Page 1 of 1

Use of Not for more than 1 character.

Posted: Wed Jul 28, 2010 4:17 pm
by blisstrader
I am data mining from HTML, but running into some problems as the data fields for each record are not always the same.

Here's a simplified extract of the source: (assume each line has a crlf)

<td>
Name: joe bloggs
email: joe@bloggs.com
website: www-joeblogs-com
</td>
<td>
Name: Paul Smith
website: www-paulsmith-com
</td>
<td>
name: Fred Flintstone
email: fred@flinstones.com
<td>

I am wanting to turn this data into a CSV, ie, Name, email, website etc.

here is what I have come up with. (using easypatterns Find/Replace

Find:
['name:' capture(1+chars)crlf][1+chars]['email:' capture(1+chars)crlf]

Replace:
$1,$2\013\010

However, this only works, if the record has a name and email. if the record doesnt have an email address, it will skip forward to the next record, and grab that email address. Ideally I need the code that captures the email address to specify not '</td>', ie that it only matches for $2 if the email address is found before we hit the next '</td>'. I am at a loss as how to do this, have spent days attempting to get it to work..

Any help greatly appreciated.

Thanks,
Dean.

Re: Use of Not for more than 1 character.

Posted: Thu Jul 29, 2010 12:17 pm
by blisstrader
So I figured out that I need to use an Or statement..

Search:
['<TD>'crlf capture(1+chars)'crlf'][1+chars][('email:' capture(1+chars)crlf)or('</TD>')]

Replace:
$1, $2, etc

This works great, and leaves the output for $2 blank if no email address is found, and restarts the search if it hits the </TD> tag.

It work great when there are only 2 options, ie Get Email address, before we hit TD.

What if I want to be a bit more thorough, and Get email address, (if it exists before </TD>), then if not, Get www: If it exists before </TD> then if not Get Phone: if it exists before </TD>.

All records in my data contain Name:, but email, Phone, www etc are optional. Can anyone lend a hand in creating the above search / replace.

Thanks,
Dean.

Re: Use of Not for more than 1 character.

Posted: Wed Aug 11, 2010 10:31 am
by DataMystic Support
Hi Dean - you are making life very complicated for yourself! Try this:

Code: Select all

<td>
Name: [ capture(1+ chars ) ]
[ optional( 'email:', capture( 1+ chars ), cr, lf ),
  optional( 'website:', capture( 1+ chars ), cr, lf )
]</td>