Hi,
I'm simply trying to extract multiple company listings from an html document of a business directory. I have renamed the file from .htm to .txt for processing and would like to place the data into an excel spreadsheet. Specifically, I need to extract 1) Street Address 2) City 3) Province 4)Phone 5) Key Contact 6) Key Contact Title.
Here is a snippet of code of a typical directory listing:
==========================================
CODE
==========================================
<SPAN class="bh1">
170 Laurier Ave. West, Suite 902<BR>
Ottawa ON, K1P 5V5<BR><BR>
Phone #: 613-234-1810<BR> Fax #: 613-234-0797<BR>
<BR>Key contact: Kevin Dee , Chief Executive Officer<BR>
Toll free #: 888-361-0579<BR>
<BR>
Website: <A href="http://www.eagleonline.com/" target="_blank">www.eagleonline.com</A><BR> Email: <A href="mailto:David_Obrien@eagleonline.com">David_Obrien@eagleonline.com</A><BR>
<BR><BR>Approximately 80 employees work at this location
</SPAN>
<BR><BR>
<TABLE border="0" width="280" cellspacing="0" cellpadding="0">
<TBODY><TR>
<TD>
<H2>Business Activity</H2> </TD>
</TR>
<TR>
<TD class="bh1">
Service<BR> </TD>
</TR>
</TBODY></TABLE>
==========================================
I'm currently trying to figure out how to extract the address which i know "usually" begins with 1+ digits and is followed by 1+ letters, 0+ whitespace, 0+ symbols, 0+ punctuation. Here is what I have come up with and is not working:
[mustBeginWith('span class="bh1"',rightAngle)][capture(0+ letters or digits or whitespace or symbols)] [mustEndWith(leftAngle,'
br',rightAngle)]
Can anyone help me out??? I have read the entire reference and tutorial on how to extract data from a web page but I do not quite understand what I'm doing wrong and what I'd have to do to extract to csv format..
Thanks
Extract multiple data fields from webpage directory into CSV
Moderators: DataMystic Support, Moderators, DataMystic Support, Moderators, DataMystic Support, Moderators
-
- Posts: 3
- Joined: Sat May 22, 2010 1:43 pm
- DataMystic Support
- Site Admin
- Posts: 2227
- Joined: Mon Jun 30, 2003 12:32 pm
- Location: Melbourne, Australia
- Contact:
Re: Extract multiple data fields from webpage directory into CSV
Try something like this:
Code: Select all
<SPAN class="bh1">
[ capture( 0+ not '<' ) ]<BR>
[ capture( 0+ not '<' ) ]<BR><BR>
[ capture( 0+ not '<' ) ]<BR>[ capture( 0+ not '<' ) ]<BR>
<BR>[ capture( 0+ not '<' ) ]<BR>
[ capture( 0+ not '<' ) ]<BR>
<BR>
Website: [ capture( 0+ not '<' ) ]<BR>[ capture( 0+ not '<' ) ]<BR>
<BR>[ capture( 0+ char ) ]
</SPAN>