Page 1 of 1

Extracting text from HTML, replacing with random codes

Posted: Tue Apr 14, 2009 1:50 am
by nelsoncruz
I'm wondering if Textpipe can do this. I want to extract every line of text from a HTML file, replacing each with a short (5 char max) random or sequencial code, and output every code + text line to a separate text file.

So, I have an HTML file like this:

<.....> text line 1 </.....>
<.....> text line 2 </.....>
<.....> text line 3 </.....>

And I want to end with:

<.....> [code1] </.....>
<.....> [code2] </.....>
<.....> [code3] </.....>

Plus output to a text file or clipboard the following:
[code1]text line 1
[code2]text line 2
[code3]text line 3

My goal is to send this text file to a translator (that works only in MS Word), and then reverse the process to insert translated text back into proper place. Can I do this somehow with Textpipe?

Re: Extracting text from HTML, replacing with random codes

Posted: Sat Apr 18, 2009 7:59 pm
by DataMystic Support
Yes, it's possible.

First use a perl pattern to match the html text e.g.

[^<>]*?

Use a subfilter to take this result and replace it with a random digit, but also send it to a new file.

The example filter script filter\replace filename with file contents.fll should be a good guide

Re: Extracting text from HTML, replacing with random codes

Posted: Sun Apr 19, 2009 4:50 am
by nelsoncruz
That pattern doesn't seem to work...

Re: Extracting text from HTML, replacing with random codes

Posted: Sun Apr 19, 2009 9:15 pm
by DataMystic Support
The pattern is perfect:

Code: Select all

[^<>]*?
Replace with

Code: Select all

$0

Re: Extracting text from HTML, replacing with random codes

Posted: Mon Apr 20, 2009 3:25 am
by nelsoncruz
Either I'm doing something wrong, or [^<>]*? targets everything inside or outside <>.

If I make a "find pattern" for [^<>]*? and replace with $0, then add a subfilter replacing . with @randomdigit I get something like:
<4845>856202931309492836753331170<66489>
from
<font>You can type sample text in</font>

The objective here is:
<ignore>capture<ignore>

Here is something that does seem to work so far:
--Perl pattern [>(.+)<] with [$0]
|
+--Perl pattern [^(.+)$] with [[@randomdigit@@randomdigit@@randomdigit@@randomdigit@@randomdigit@]]

This does the following:
input: <font>You can type sample text in</font>
output: <font>[59473]</font>

Now the question is, how do I output a 2nd pure text file with:
[59473]You can type sample text in

Re: Extracting text from HTML, replacing with random codes

Posted: Mon Apr 20, 2009 5:29 am
by nelsoncruz
Solved it! :D

At the end of
+--Perl pattern [^(.+)$] with [[@randomdigit@@randomdigit@@randomdigit@@randomdigit@@randomdigit@]]
I added a tab (\t) + $1 (text) + return (\r\n).

Then I output this (only tried output to clipboard for testing). The return at the end makes sure each code/text pair goes to a new line.

Then I added a new replace step to remove the tab+text+return after each random code, and I got what I wanted. The HTML file has 5 digit codes where each text line was, and a separate tab delimited list of code/text pairs is created. After testing with a couple files I made a few changes to the initial perl pattern, to make it ignore single returns, single space characters, and HTML tags within other HTML tags.

I used a "search/replace list" filter with a .tab file with the code/text pairs to reverse the process, and I restored the HTML file to original form (verified by MD5 hash :wink:).

I only saw one small annoyance so far. Some text lines start with &nbsp; (HTML code for space char). No biggie. But it would be great if the initial perl pattern could be adjusted to make textpipe ignore them (leave them in the HTML file, and not output to secondary file). Any suggestions?

That and if I could output the code/text list directly to MS Word format (which the translator guy prefers), would make this perfect for my needs!

Code: Select all

|--Perl pattern [>([^ <>\r].+)<] with [$0]
|  |  [ ] Match case
|  |  [ ] Whole words only
|  |  [ ] Case sensitive replace
|  |  [ ] Prompt on replace
|  |  [ ] Skip prompt if identical
|  |  [ ] First only
|  |  [ ] Extract matches
|  |  Maximum text buffer size 4096
|  |  [ ] Maximum match (greedy)
|  |  [ ] Allow comments
|  |  [X] '.' matches newline
|  |  [ ] UTF-8 Support
|  |
|  +--Perl pattern [^(.+)$] with [[@randomdigit@@randomdigit@@randomdigit@@randomdigit@@randomdigit@]\t$1\r\n]
|     |  [ ] Match case
|     |  [ ] Whole words only
|     |  [ ] Case sensitive replace
|     |  [ ] Prompt on replace
|     |  [ ] Skip prompt if identical
|     |  [ ] First only
|     |  [ ] Extract matches
|     |  Maximum text buffer size 4096
|     |  [ ] Maximum match (greedy)
|     |  [ ] Allow comments
|     |  [X] '.' matches newline
|     |  [ ] UTF-8 Support
|     |
|     |--Output to clipboard
|     |   
|     +--Perl pattern [(\[.*]).*\r\n] with [$1]
|           [ ] Match case
|           [ ] Whole words only
|           [ ] Case sensitive replace
|           [ ] Prompt on replace
|           [ ] Skip prompt if identical
|           [ ] First only
|           [ ] Extract matches
|           Maximum text buffer size 4096
|           [ ] Maximum match (greedy)
|           [ ] Allow comments
|           [X] '.' matches newline
|           [ ] UTF-8 Support

Re: Extracting text from HTML, replacing with random codes

Posted: Mon Apr 20, 2009 6:50 am
by nelsoncruz
I revised the initial perl pattern to: >([^<\r][^<\r].*)<

This allows capturing text strings that start with a space, but not something like:
> <IMG...><

Neither < nor return chars are allowed as 1st or 2nd chars of the string.

I revised again to >(&nbsp;| |)([^<\r][^<\r].*|)(&nbsp;| |)(\r\n|)<. This avoids capture of empty spaces or &nbsp; at the start or end of a text string, as well as a return/new line at the end.

Re: Extracting text from HTML, replacing with random codes

Posted: Tue Apr 21, 2009 6:28 am
by nelsoncruz
My question now is, could I run this stuff with Textipe Lite?

I'm only using "Find perl patern" and secondary output functions in the filter I described, but I need search/replace list (with tab delimited text file) to reverse the process. Does Textpipe Lite have that? The Standard and Pro versions are too expensive for me...

Re: Extracting text from HTML, replacing with random codes

Posted: Tue Apr 21, 2009 2:50 pm
by DataMystic Support
Hi Nelson,

The Lite version has the search/replace facility, but it doesn't have the secondary output filter. I don't think you need a secondary output filter- just sending results to a new folder should work.

Re: Extracting text from HTML, replacing with random codes

Posted: Wed Apr 22, 2009 7:50 pm
by nelsoncruz
Hi Simon,

Remember that I need to output 2 files, the transformed HTML file + a text file with the extracted text. How do I do that without the secondary output?

Re: Extracting text from HTML, replacing with random codes

Posted: Thu Apr 23, 2009 7:54 am
by DataMystic Support
You're right - you can't. Only the Pro version supports the Secondary Output (and/or VBScript). A complete reference chart is
http://www.datamystic.com/textpipe/pro_compare.html