Extract multiple data fields from webpage directory into CSV
Posted: Sat Jun 05, 2010 1:03 pm
Hi,
I'm simply trying to extract multiple company listings from an html document of a business directory. I have renamed the file from .htm to .txt for processing and would like to place the data into an excel spreadsheet. Specifically, I need to extract 1) Street Address 2) City 3) Province 4)Phone 5) Key Contact 6) Key Contact Title.
Here is a snippet of code of a typical directory listing:
==========================================
CODE
==========================================
<SPAN class="bh1">
170 Laurier Ave. West, Suite 902<BR>
Ottawa ON, K1P 5V5<BR><BR>
Phone #: 613-234-1810<BR> Fax #: 613-234-0797<BR>
<BR>Key contact: Kevin Dee , Chief Executive Officer<BR>
Toll free #: 888-361-0579<BR>
<BR>
Website: <A href="http://www.eagleonline.com/" target="_blank">www.eagleonline.com</A><BR> Email: <A href="mailto:David_Obrien@eagleonline.com">David_Obrien@eagleonline.com</A><BR>
<BR><BR>Approximately 80 employees work at this location
</SPAN>
<BR><BR>
<TABLE border="0" width="280" cellspacing="0" cellpadding="0">
<TBODY><TR>
<TD>
<H2>Business Activity</H2> </TD>
</TR>
<TR>
<TD class="bh1">
Service<BR> </TD>
</TR>
</TBODY></TABLE>
==========================================
I'm currently trying to figure out how to extract the address which i know "usually" begins with 1+ digits and is followed by 1+ letters, 0+ whitespace, 0+ symbols, 0+ punctuation. Here is what I have come up with and is not working:
[mustBeginWith('span class="bh1"',rightAngle)][capture(0+ letters or digits or whitespace or symbols)] [mustEndWith(leftAngle,'
br',rightAngle)]
Can anyone help me out??? I have read the entire reference and tutorial on how to extract data from a web page but I do not quite understand what I'm doing wrong and what I'd have to do to extract to csv format..
Thanks
I'm simply trying to extract multiple company listings from an html document of a business directory. I have renamed the file from .htm to .txt for processing and would like to place the data into an excel spreadsheet. Specifically, I need to extract 1) Street Address 2) City 3) Province 4)Phone 5) Key Contact 6) Key Contact Title.
Here is a snippet of code of a typical directory listing:
==========================================
CODE
==========================================
<SPAN class="bh1">
170 Laurier Ave. West, Suite 902<BR>
Ottawa ON, K1P 5V5<BR><BR>
Phone #: 613-234-1810<BR> Fax #: 613-234-0797<BR>
<BR>Key contact: Kevin Dee , Chief Executive Officer<BR>
Toll free #: 888-361-0579<BR>
<BR>
Website: <A href="http://www.eagleonline.com/" target="_blank">www.eagleonline.com</A><BR> Email: <A href="mailto:David_Obrien@eagleonline.com">David_Obrien@eagleonline.com</A><BR>
<BR><BR>Approximately 80 employees work at this location
</SPAN>
<BR><BR>
<TABLE border="0" width="280" cellspacing="0" cellpadding="0">
<TBODY><TR>
<TD>
<H2>Business Activity</H2> </TD>
</TR>
<TR>
<TD class="bh1">
Service<BR> </TD>
</TR>
</TBODY></TABLE>
==========================================
I'm currently trying to figure out how to extract the address which i know "usually" begins with 1+ digits and is followed by 1+ letters, 0+ whitespace, 0+ symbols, 0+ punctuation. Here is what I have come up with and is not working:
[mustBeginWith('span class="bh1"',rightAngle)][capture(0+ letters or digits or whitespace or symbols)] [mustEndWith(leftAngle,'
br',rightAngle)]
Can anyone help me out??? I have read the entire reference and tutorial on how to extract data from a web page but I do not quite understand what I'm doing wrong and what I'd have to do to extract to csv format..
Thanks