Extracting text from HTML, replacing with random codes

Get help with installation and running here.

Moderators: DataMystic Support, Moderators, DataMystic Support, Moderators, DataMystic Support, Moderators

Post Reply
nelsoncruz
Posts: 7
Joined: Tue Apr 14, 2009 1:20 am

Extracting text from HTML, replacing with random codes

Post by nelsoncruz »

I'm wondering if Textpipe can do this. I want to extract every line of text from a HTML file, replacing each with a short (5 char max) random or sequencial code, and output every code + text line to a separate text file.

So, I have an HTML file like this:

<.....> text line 1 </.....>
<.....> text line 2 </.....>
<.....> text line 3 </.....>

And I want to end with:

<.....> [code1] </.....>
<.....> [code2] </.....>
<.....> [code3] </.....>

Plus output to a text file or clipboard the following:
[code1]text line 1
[code2]text line 2
[code3]text line 3

My goal is to send this text file to a translator (that works only in MS Word), and then reverse the process to insert translated text back into proper place. Can I do this somehow with Textpipe?
User avatar
DataMystic Support
Site Admin
Posts: 2227
Joined: Mon Jun 30, 2003 12:32 pm
Location: Melbourne, Australia
Contact:

Re: Extracting text from HTML, replacing with random codes

Post by DataMystic Support »

Yes, it's possible.

First use a perl pattern to match the html text e.g.

[^<>]*?

Use a subfilter to take this result and replace it with a random digit, but also send it to a new file.

The example filter script filter\replace filename with file contents.fll should be a good guide
nelsoncruz
Posts: 7
Joined: Tue Apr 14, 2009 1:20 am

Re: Extracting text from HTML, replacing with random codes

Post by nelsoncruz »

That pattern doesn't seem to work...
User avatar
DataMystic Support
Site Admin
Posts: 2227
Joined: Mon Jun 30, 2003 12:32 pm
Location: Melbourne, Australia
Contact:

Re: Extracting text from HTML, replacing with random codes

Post by DataMystic Support »

The pattern is perfect:

Code: Select all

[^<>]*?
Replace with

Code: Select all

$0
nelsoncruz
Posts: 7
Joined: Tue Apr 14, 2009 1:20 am

Re: Extracting text from HTML, replacing with random codes

Post by nelsoncruz »

Either I'm doing something wrong, or [^<>]*? targets everything inside or outside <>.

If I make a "find pattern" for [^<>]*? and replace with $0, then add a subfilter replacing . with @randomdigit I get something like:
<4845>856202931309492836753331170<66489>
from
<font>You can type sample text in</font>

The objective here is:
<ignore>capture<ignore>

Here is something that does seem to work so far:
--Perl pattern [>(.+)<] with [$0]
|
+--Perl pattern [^(.+)$] with [[@randomdigit@@randomdigit@@randomdigit@@randomdigit@@randomdigit@]]

This does the following:
input: <font>You can type sample text in</font>
output: <font>[59473]</font>

Now the question is, how do I output a 2nd pure text file with:
[59473]You can type sample text in
nelsoncruz
Posts: 7
Joined: Tue Apr 14, 2009 1:20 am

Re: Extracting text from HTML, replacing with random codes

Post by nelsoncruz »

Solved it! :D

At the end of
+--Perl pattern [^(.+)$] with [[@randomdigit@@randomdigit@@randomdigit@@randomdigit@@randomdigit@]]
I added a tab (\t) + $1 (text) + return (\r\n).

Then I output this (only tried output to clipboard for testing). The return at the end makes sure each code/text pair goes to a new line.

Then I added a new replace step to remove the tab+text+return after each random code, and I got what I wanted. The HTML file has 5 digit codes where each text line was, and a separate tab delimited list of code/text pairs is created. After testing with a couple files I made a few changes to the initial perl pattern, to make it ignore single returns, single space characters, and HTML tags within other HTML tags.

I used a "search/replace list" filter with a .tab file with the code/text pairs to reverse the process, and I restored the HTML file to original form (verified by MD5 hash :wink:).

I only saw one small annoyance so far. Some text lines start with &nbsp; (HTML code for space char). No biggie. But it would be great if the initial perl pattern could be adjusted to make textpipe ignore them (leave them in the HTML file, and not output to secondary file). Any suggestions?

That and if I could output the code/text list directly to MS Word format (which the translator guy prefers), would make this perfect for my needs!

Code: Select all

|--Perl pattern [>([^ <>\r].+)<] with [$0]
|  |  [ ] Match case
|  |  [ ] Whole words only
|  |  [ ] Case sensitive replace
|  |  [ ] Prompt on replace
|  |  [ ] Skip prompt if identical
|  |  [ ] First only
|  |  [ ] Extract matches
|  |  Maximum text buffer size 4096
|  |  [ ] Maximum match (greedy)
|  |  [ ] Allow comments
|  |  [X] '.' matches newline
|  |  [ ] UTF-8 Support
|  |
|  +--Perl pattern [^(.+)$] with [[@randomdigit@@randomdigit@@randomdigit@@randomdigit@@randomdigit@]\t$1\r\n]
|     |  [ ] Match case
|     |  [ ] Whole words only
|     |  [ ] Case sensitive replace
|     |  [ ] Prompt on replace
|     |  [ ] Skip prompt if identical
|     |  [ ] First only
|     |  [ ] Extract matches
|     |  Maximum text buffer size 4096
|     |  [ ] Maximum match (greedy)
|     |  [ ] Allow comments
|     |  [X] '.' matches newline
|     |  [ ] UTF-8 Support
|     |
|     |--Output to clipboard
|     |   
|     +--Perl pattern [(\[.*]).*\r\n] with [$1]
|           [ ] Match case
|           [ ] Whole words only
|           [ ] Case sensitive replace
|           [ ] Prompt on replace
|           [ ] Skip prompt if identical
|           [ ] First only
|           [ ] Extract matches
|           Maximum text buffer size 4096
|           [ ] Maximum match (greedy)
|           [ ] Allow comments
|           [X] '.' matches newline
|           [ ] UTF-8 Support
nelsoncruz
Posts: 7
Joined: Tue Apr 14, 2009 1:20 am

Re: Extracting text from HTML, replacing with random codes

Post by nelsoncruz »

I revised the initial perl pattern to: >([^<\r][^<\r].*)<

This allows capturing text strings that start with a space, but not something like:
> <IMG...><

Neither < nor return chars are allowed as 1st or 2nd chars of the string.

I revised again to >(&nbsp;| |)([^<\r][^<\r].*|)(&nbsp;| |)(\r\n|)<. This avoids capture of empty spaces or &nbsp; at the start or end of a text string, as well as a return/new line at the end.
nelsoncruz
Posts: 7
Joined: Tue Apr 14, 2009 1:20 am

Re: Extracting text from HTML, replacing with random codes

Post by nelsoncruz »

My question now is, could I run this stuff with Textipe Lite?

I'm only using "Find perl patern" and secondary output functions in the filter I described, but I need search/replace list (with tab delimited text file) to reverse the process. Does Textpipe Lite have that? The Standard and Pro versions are too expensive for me...
User avatar
DataMystic Support
Site Admin
Posts: 2227
Joined: Mon Jun 30, 2003 12:32 pm
Location: Melbourne, Australia
Contact:

Re: Extracting text from HTML, replacing with random codes

Post by DataMystic Support »

Hi Nelson,

The Lite version has the search/replace facility, but it doesn't have the secondary output filter. I don't think you need a secondary output filter- just sending results to a new folder should work.
nelsoncruz
Posts: 7
Joined: Tue Apr 14, 2009 1:20 am

Re: Extracting text from HTML, replacing with random codes

Post by nelsoncruz »

Hi Simon,

Remember that I need to output 2 files, the transformed HTML file + a text file with the extracted text. How do I do that without the secondary output?
User avatar
DataMystic Support
Site Admin
Posts: 2227
Joined: Mon Jun 30, 2003 12:32 pm
Location: Melbourne, Australia
Contact:

Re: Extracting text from HTML, replacing with random codes

Post by DataMystic Support »

You're right - you can't. Only the Pro version supports the Secondary Output (and/or VBScript). A complete reference chart is
http://www.datamystic.com/textpipe/pro_compare.html
Post Reply