Page 1 of 1

Extract text from HTML tag

Posted: Wed Aug 27, 2008 12:23 pm
by asoydah
How can i extract the text from HTML tag?
Example :
<h3><b><a name="F5">Family SUV / Wagon</a></b></h3>
<h4>Mitsubishi Outlander or similar</h4>
<ul>
<li>4 door SUV</li>
<li><b>Auto</b>, Power Steering, MP3/CD player</li>
<li>Air Conditioning</li>
<li>

I want to extract to be an XML file like
<name>Mitsubishi Outlander or similar</name>
<desc>Family Suv</desc>

can someone help me with the filter?

Re: Extract text from HTML tag

Posted: Wed Aug 27, 2008 2:17 pm
by DataMystic Support
Use an EasyPattern search/replace with Extract option turned on.

Replace variable sections with:
[ capture(1+ chars) as 'car' ]

Re: Extract text from HTML tag

Posted: Wed Aug 27, 2008 4:53 pm
by asoydah
Sorry for being the idiot here.. :(
but which one should I replace?

Re: Extract text from HTML tag

Posted: Wed Aug 27, 2008 5:40 pm
by DataMystic Support
Are you buying TextPipe Pro..? Have you looked at the web site mining docs at http://www.datamystic.com/docs ?

Re: Extract text from HTML tag

Posted: Thu Aug 28, 2008 9:32 pm
by Fixer
Hi asoydah
I made for You filter in TextPipe.
Download it here: http://plikojad.pl/bbg79d7lxszl (cars.rar > unzip to cars.fll and open it)
:)

Result:

Code: Select all

<cars>
  <name>Mitsubishi Outlander or similar</name>
  <desc>Family SUV / Wagon</desc>
    <option>4 door SUV</option>
    <option><b>Auto</b>, Power Steering, MP3/CD player</option>
    <option>Air Conditioning</option>
</cars>

Re: Extract text from HTML tag

Posted: Thu Aug 28, 2008 10:20 pm
by DataMystic Support
Thanks Fixer!

Re: Extract text from HTML tag

Posted: Thu Aug 28, 2008 11:01 pm
by asoydah
Thx Fixer.. I already click that but can't download anything?
Maybe that's a broken link?

Re: Extract text from HTML tag

Posted: Fri Aug 29, 2008 10:19 pm
by Fixer
No it works! Oh gosh...
You must click twice! (first on the link and next on the file cars.rar)
But ok don't worry try now click this directly link: http://plikojad.pl/download/bbk4h4wsehn ... 642cd74058

Image