Extract multiple data fields from webpage directory into CSV

trpaquette · Post by **trpaquette** » Sat Jun 05, 2010 1:03 pm

Hi,

I'm simply trying to extract multiple company listings from an html document of a business directory. I have renamed the file from .htm to .txt for processing and would like to place the data into an excel spreadsheet. Specifically, I need to extract 1) Street Address 2) City 3) Province 4)Phone 5) Key Contact 6) Key Contact Title.

Here is a snippet of code of a typical directory listing:

==========================================
CODE
==========================================
<SPAN class="bh1">
170 Laurier Ave. West, Suite 902<BR>
Ottawa ON, K1P 5V5<BR><BR>
Phone #: 613-234-1810<BR> Fax #: 613-234-0797<BR>
<BR>Key contact: Kevin Dee , Chief Executive Officer<BR>

Toll free #: 888-361-0579<BR>
<BR>
Website: <A href="http://www.eagleonline.com/" target="_blank">www.eagleonline.com</A><BR> Email: <A href="mailto:David_Obrien@eagleonline.com">David_Obrien@eagleonline.com</A><BR>
<BR><BR>Approximately 80 employees work at this location
</SPAN>
<BR><BR>

<TABLE border="0" width="280" cellspacing="0" cellpadding="0">
<TBODY><TR>
<TD>
<H2>Business Activity</H2> </TD>
</TR>
<TR>
<TD class="bh1">
Service<BR> </TD>

</TR>
</TBODY></TABLE>
==========================================
I'm currently trying to figure out how to extract the address which i know "usually" begins with 1+ digits and is followed by 1+ letters, 0+ whitespace, 0+ symbols, 0+ punctuation. Here is what I have come up with and is not working:

[mustBeginWith('span class="bh1"',rightAngle)][capture(0+ letters or digits or whitespace or symbols)] [mustEndWith(leftAngle,'
br',rightAngle)]

Can anyone help me out??? I have read the entire reference and tutorial on how to extract data from a web page but I do not quite understand what I'm doing wrong and what I'd have to do to extract to csv format..

Thanks

Post by **DataMystic Support** » Mon Jun 07, 2010 9:21 am

Try something like this:

Code: Select all

<SPAN class="bh1">
[ capture( 0+ not '<' ) ]<BR>
[ capture( 0+ not '<' ) ]<BR><BR>
[ capture( 0+ not '<' ) ]<BR>[ capture( 0+ not '<' ) ]<BR>
<BR>[ capture( 0+ not '<' ) ]<BR>

[ capture( 0+ not '<' ) ]<BR>
<BR>
Website: [ capture( 0+ not '<' ) ]<BR>[ capture( 0+ not '<' ) ]<BR>
<BR>[ capture( 0+ char ) ]
</SPAN>

Extract multiple data fields from webpage directory into CSV

Extract multiple data fields from webpage directory into CSV

Re: Extract multiple data fields from webpage directory into CSV