Extract META tag data from long list of websites?

Moz · Post by **Moz** » Thu Aug 09, 2007 4:39 am

Hi,

I've seen the download websites filter for Textpipe - is it possible to have a list of URL's in a text file, use Textpipe to download them and then extract the META tag data like Description?

Thanks,

Moz

Post by **DataMystic Support** » Thu Aug 09, 2007 9:29 am

Absolutely. Just use a Search/Replace with the Extract option, e.g. using the perl pattern:

<meta[^>]*NAME="description"[^>]*>

Replace with

@FullInputFilename: $0\r\n

Moz · Post by **Moz** » Fri Aug 10, 2007 8:10 am

DataMystic Support wrote:Absolutely. Just use a Search/Replace with the Extract option, e.g. using the perl pattern:

<meta[^>]*NAME="description"[^>]*>

Replace with

@FullInputFilename: $0\r\n

Hi Simon,

Thanks for the (as usual) very speedy response on this, however I'm still a little stuck.

I can get the sites downloaded using the WsInetTools filter (which unfortunately crashes Textpad after a while) but then I don't fully understand your comment above about then extracting the meta data.

I have the websites now downloaded into one long text file and from what I can understand from your post I should have a filter like this:
Input File: text file with downloaded webpages
1st Filter: Extract Lines matching perl pattern <meta[^>]*NAME="description"[^>]*>
1st Subfilter of 1st Filter: Replace perl pattern "<meta[^>]*NAME="description"[^>]*> " with " @FullInputFilename: $0\r\n"

I'm testing this one one long file of about 500 webpages and it only manages to extract two meta descriptions yet there are hundreds of them in the file such as:

Code: Select all

<meta name="description" content="Bethany Home is a School and Training Centre for disabled children and adults.
    (could be because there's a new line at the end of this and no > ? )

<meta NAME="description" CONTENT="Conservation holiday volunteer work project, volunteering, wildlife and environmental volunteer. Freiwilligenarbeit, Abenteuerurlaub, Abenteuerreisen, Arbeitsurlaub, Naturreisen, Expeditionsreisen. ecotourisme, Planete Urgence, protection de l'environnement, ecovolontariat, écotourisme, écovolontariat, ecovolontaire, Des vacances studieuses, vacances actives, vacances de travail.">

<meta name="Description" content="Blue Ventures: through our projects and marine expeditions in Madagascar, Africa we enhance global marine conservation and research. Ideal for student gap year placements. Award winning, not-for-profit organisation" />

<meta name="description" content="BMS World Mission">

None of the above are saved in the output file. Can you point me in the right direction please?

Many thanks,

David

Post by **DataMystic Support** » Fri Aug 10, 2007 9:56 am

Hi David,

TextPipe can accept URLs in the file list, so you can dispense with with WsInetTools filter altogether.

However, given that you already have the data in one file, JUST use a single replace perl pattern filter with the text I desribed. The restriction you added will force it to only match META tags that fit on one line.