Page 1 of 1

Please help me with this text extraction

Posted: Sun Jun 12, 2011 10:00 am
by tiler
Hi

Am really new to all this textpipe stuff its far to clever for me and was wondering if someone could lend a hand.

I have worked out how to extract individual items from some files I have converted fron html to text (conversion not in textpipe) but wondered if this was possible.

I have a large number of pages all with the same structure, within the top 50 or so lines of each page is the detail I need. Below is a cut and paste and the details I need in bold.

Using individual filters for telephone and website I can do but is there a way to combine these filters to remove the text I want.


Code: Select all

Tru and Grand - Uk- For all your Ashes


,




,,,,,,,,,,,



,,

,,
,



Tru and Grand 


Ashes specialists Wales and Australia





Company,[b]Tru and Grand [/b],

Click For Website

Contact,[b]Mr Ash[/b](
Address,[b]Unit 1/
Top Road
On a hill
Wales
UK
ABC 123[/b] (MAP)

Telephone,[b]12345 456 234[/b]
Fax,[b]321 8566 999[/b]
Email,[b].......[/b]
Website,[b]wwww.oops.com[/b]

Tru and Grand  was founded in 1650 and is very sorry but a specialist in Ashes
The email is not visible due to a script in the html, I have offline explorer here so maybe I can get it by mining that way unless there is a better idea. The (MAP) reference is a googmaps link


Your help would be greatfully received

Tiler

Re: Please help me with this text extraction

Posted: Mon Jun 13, 2011 10:08 pm
by tiler
Hi

Am still struggling with the above, I have got this far :

website,[ 1 + chars ] > website,wwww.oops.com .....................Would like just the web address but I can remove that in excel I guess

Telephone,[ 1 + digits ] > Telephone,12345 456 234 .........................As above really

Address,[ 1 + chars ] > Address,Unit 1/ ..............................Can't get this at all stops on first line of address


The email is a problem as it does not show up in the text page, in the coded page however it shows up just as ( well at least I think this is it ):

Code: Select all

 {
      s=s + t.charAt(l-i);
  }
  
  document.write('<A href=\'mailto:' + s + '?subject=Enquiry from ashes.co.uk\'>' + s + '</a>');
}
</SCRIPT>

Can anyone help me move on with any of the above please ??

Tiler

Re: Please help me with this text extraction

Posted: Tue Jun 14, 2011 4:43 pm
by DataMystic Support
You can't easily extract email addresses from script.

You would have to attach some script to each web page that runs at the end of the page rendering, which then extracts it.

Re: Please help me with this text extraction

Posted: Tue Jun 14, 2011 10:12 pm
by tiler
Hi

Thank you for that, we can safely say then that I won't be getting the email addresses as I don't have a clue about that.


I have managed to get the website out using

Code: Select all

(?:website)(?:.+)(?:
)

I can get all the details out together as below,

Company,Tru and Grand ,
Contact,Mr Ash(
Address,Unit 1/
Top Road
On a hill
Wales
UK
ABC 123
(MAP)

Telephone,12345 456 234
Fax,321 8566 999
Email,.......
Website,wwww.oops.com


Using :

Code: Select all

(?:company)(?:.+)(?:
)(?:website)(?:.+)(?:
)
I have also made some adjustments using other filters.


What I can not do and maybe you would be willing to help is :

Get just company and website out together ?

When I get the above into excel they list vertically I want them to list horizontally across the page ?


It has taken me 3 days to get this far as I know nothing of code whatsoever and I am truely stuck now..........

Thank you

Tiler

Re: Please help me with this text extraction

Posted: Wed Jun 15, 2011 9:13 am
by DataMystic Support
In the 'replace with' box, put

Code: Select all

$1,$2,$3,$4 
or

Code: Select all

@company,@website,@etc