Use of Not for more than 1 character.
Posted: Wed Jul 28, 2010 4:17 pm
I am data mining from HTML, but running into some problems as the data fields for each record are not always the same.
Here's a simplified extract of the source: (assume each line has a crlf)
<td>
Name: joe bloggs
email: joe@bloggs.com
website: www-joeblogs-com
</td>
<td>
Name: Paul Smith
website: www-paulsmith-com
</td>
<td>
name: Fred Flintstone
email: fred@flinstones.com
<td>
I am wanting to turn this data into a CSV, ie, Name, email, website etc.
here is what I have come up with. (using easypatterns Find/Replace
Find:
['name:' capture(1+chars)crlf][1+chars]['email:' capture(1+chars)crlf]
Replace:
$1,$2\013\010
However, this only works, if the record has a name and email. if the record doesnt have an email address, it will skip forward to the next record, and grab that email address. Ideally I need the code that captures the email address to specify not '</td>', ie that it only matches for $2 if the email address is found before we hit the next '</td>'. I am at a loss as how to do this, have spent days attempting to get it to work..
Any help greatly appreciated.
Thanks,
Dean.
Here's a simplified extract of the source: (assume each line has a crlf)
<td>
Name: joe bloggs
email: joe@bloggs.com
website: www-joeblogs-com
</td>
<td>
Name: Paul Smith
website: www-paulsmith-com
</td>
<td>
name: Fred Flintstone
email: fred@flinstones.com
<td>
I am wanting to turn this data into a CSV, ie, Name, email, website etc.
here is what I have come up with. (using easypatterns Find/Replace
Find:
['name:' capture(1+chars)crlf][1+chars]['email:' capture(1+chars)crlf]
Replace:
$1,$2\013\010
However, this only works, if the record has a name and email. if the record doesnt have an email address, it will skip forward to the next record, and grab that email address. Ideally I need the code that captures the email address to specify not '</td>', ie that it only matches for $2 if the email address is found before we hit the next '</td>'. I am at a loss as how to do this, have spent days attempting to get it to work..
Any help greatly appreciated.
Thanks,
Dean.