I am data mining from HTML, but running into some problems as the data fields for each record are not always the same.
Here's a simplified extract of the source: (assume each line has a crlf)
<td>
Name: joe bloggs
email: joe@bloggs.com
website: www-joeblogs-com
</td>
<td>
Name: Paul Smith
website: www-paulsmith-com
</td>
<td>
name: Fred Flintstone
email: fred@flinstones.com
<td>
I am wanting to turn this data into a CSV, ie, Name, email, website etc.
here is what I have come up with. (using easypatterns Find/Replace
Find:
['name:' capture(1+chars)crlf][1+chars]['email:' capture(1+chars)crlf]
Replace:
$1,$2\013\010
However, this only works, if the record has a name and email. if the record doesnt have an email address, it will skip forward to the next record, and grab that email address. Ideally I need the code that captures the email address to specify not '</td>', ie that it only matches for $2 if the email address is found before we hit the next '</td>'. I am at a loss as how to do this, have spent days attempting to get it to work..
Any help greatly appreciated.
Thanks,
Dean.
Use of Not for more than 1 character.
Moderators: DataMystic Support, Moderators, DataMystic Support, Moderators, DataMystic Support, Moderators
-
- Posts: 2
- Joined: Wed Jul 28, 2010 3:58 pm
Re: Use of Not for more than 1 character.
So I figured out that I need to use an Or statement..
Search:
['<TD>'crlf capture(1+chars)'crlf'][1+chars][('email:' capture(1+chars)crlf)or('</TD>')]
Replace:
$1, $2, etc
This works great, and leaves the output for $2 blank if no email address is found, and restarts the search if it hits the </TD> tag.
It work great when there are only 2 options, ie Get Email address, before we hit TD.
What if I want to be a bit more thorough, and Get email address, (if it exists before </TD>), then if not, Get www: If it exists before </TD> then if not Get Phone: if it exists before </TD>.
All records in my data contain Name:, but email, Phone, www etc are optional. Can anyone lend a hand in creating the above search / replace.
Thanks,
Dean.
Search:
['<TD>'crlf capture(1+chars)'crlf'][1+chars][('email:' capture(1+chars)crlf)or('</TD>')]
Replace:
$1, $2, etc
This works great, and leaves the output for $2 blank if no email address is found, and restarts the search if it hits the </TD> tag.
It work great when there are only 2 options, ie Get Email address, before we hit TD.
What if I want to be a bit more thorough, and Get email address, (if it exists before </TD>), then if not, Get www: If it exists before </TD> then if not Get Phone: if it exists before </TD>.
All records in my data contain Name:, but email, Phone, www etc are optional. Can anyone lend a hand in creating the above search / replace.
Thanks,
Dean.
- DataMystic Support
- Site Admin
- Posts: 2227
- Joined: Mon Jun 30, 2003 12:32 pm
- Location: Melbourne, Australia
- Contact:
Re: Use of Not for more than 1 character.
Hi Dean - you are making life very complicated for yourself! Try this:
Code: Select all
<td>
Name: [ capture(1+ chars ) ]
[ optional( 'email:', capture( 1+ chars ), cr, lf ),
optional( 'website:', capture( 1+ chars ), cr, lf )
]</td>