Use of Not for more than 1 character.

Get help with installation and running here.

Moderators: DataMystic Support, Moderators, DataMystic Support, Moderators, DataMystic Support, Moderators

Post Reply
blisstrader
Posts: 2
Joined: Wed Jul 28, 2010 3:58 pm

Use of Not for more than 1 character.

Post by blisstrader »

I am data mining from HTML, but running into some problems as the data fields for each record are not always the same.

Here's a simplified extract of the source: (assume each line has a crlf)

<td>
Name: joe bloggs
email: joe@bloggs.com
website: www-joeblogs-com
</td>
<td>
Name: Paul Smith
website: www-paulsmith-com
</td>
<td>
name: Fred Flintstone
email: fred@flinstones.com
<td>

I am wanting to turn this data into a CSV, ie, Name, email, website etc.

here is what I have come up with. (using easypatterns Find/Replace

Find:
['name:' capture(1+chars)crlf][1+chars]['email:' capture(1+chars)crlf]

Replace:
$1,$2\013\010

However, this only works, if the record has a name and email. if the record doesnt have an email address, it will skip forward to the next record, and grab that email address. Ideally I need the code that captures the email address to specify not '</td>', ie that it only matches for $2 if the email address is found before we hit the next '</td>'. I am at a loss as how to do this, have spent days attempting to get it to work..

Any help greatly appreciated.

Thanks,
Dean.
blisstrader
Posts: 2
Joined: Wed Jul 28, 2010 3:58 pm

Re: Use of Not for more than 1 character.

Post by blisstrader »

So I figured out that I need to use an Or statement..

Search:
['<TD>'crlf capture(1+chars)'crlf'][1+chars][('email:' capture(1+chars)crlf)or('</TD>')]

Replace:
$1, $2, etc

This works great, and leaves the output for $2 blank if no email address is found, and restarts the search if it hits the </TD> tag.

It work great when there are only 2 options, ie Get Email address, before we hit TD.

What if I want to be a bit more thorough, and Get email address, (if it exists before </TD>), then if not, Get www: If it exists before </TD> then if not Get Phone: if it exists before </TD>.

All records in my data contain Name:, but email, Phone, www etc are optional. Can anyone lend a hand in creating the above search / replace.

Thanks,
Dean.
User avatar
DataMystic Support
Site Admin
Posts: 2227
Joined: Mon Jun 30, 2003 12:32 pm
Location: Melbourne, Australia
Contact:

Re: Use of Not for more than 1 character.

Post by DataMystic Support »

Hi Dean - you are making life very complicated for yourself! Try this:

Code: Select all

<td>
Name: [ capture(1+ chars ) ]
[ optional( 'email:', capture( 1+ chars ), cr, lf ),
  optional( 'website:', capture( 1+ chars ), cr, lf )
]</td>
Post Reply