Extracting text from HTML, replacing with random codes

nelsoncruz · Post by **nelsoncruz** » Tue Apr 14, 2009 1:50 am

I'm wondering if Textpipe can do this. I want to extract every line of text from a HTML file, replacing each with a short (5 char max) random or sequencial code, and output every code + text line to a separate text file.

So, I have an HTML file like this:

<.....> text line 1 </.....>
<.....> text line 2 </.....>
<.....> text line 3 </.....>

And I want to end with:

<.....> [code1] </.....>
<.....> [code2] </.....>
<.....> [code3] </.....>

Plus output to a text file or clipboard the following:
[code1]text line 1
[code2]text line 2
[code3]text line 3

My goal is to send this text file to a translator (that works only in MS Word), and then reverse the process to insert translated text back into proper place. Can I do this somehow with Textpipe?

Post by **DataMystic Support** » Sat Apr 18, 2009 7:59 pm

Yes, it's possible.

First use a perl pattern to match the html text e.g.

[^<>]*?

Use a subfilter to take this result and replace it with a random digit, but also send it to a new file.

The example filter script filter\replace filename with file contents.fll should be a good guide

nelsoncruz · Post by **nelsoncruz** » Sun Apr 19, 2009 4:50 am

That pattern doesn't seem to work...

Post by **DataMystic Support** » Sun Apr 19, 2009 9:15 pm

The pattern is perfect:

Code: Select all

[^<>]*?

Replace with

Code: Select all

$0

nelsoncruz · Post by **nelsoncruz** » Mon Apr 20, 2009 3:25 am

Either I'm doing something wrong, or [^<>]*? targets everything inside or outside <>.

If I make a "find pattern" for [^<>]*? and replace with $0, then add a subfilter replacing . with @randomdigit I get something like:
<4845>856202931309492836753331170<66489>
from
You can type sample text in

The objective here is:
<ignore>capture<ignore>

Here is something that does seem to work so far:
--Perl pattern [>(.+)<] with [$0]
|
+--Perl pattern [^(.+)$] with [[@randomdigit@@randomdigit@@randomdigit@@randomdigit@@randomdigit@]]

This does the following:
input: You can type sample text in
output: [59473]

Now the question is, how do I output a 2nd pure text file with:
[59473]You can type sample text in

nelsoncruz · Post by **nelsoncruz** » Mon Apr 20, 2009 5:29 am

Solved it!

At the end of
+--Perl pattern [^(.+)$] with [[@randomdigit@@randomdigit@@randomdigit@@randomdigit@@randomdigit@]]
I added a tab (\t) + $1 (text) + return (\r\n).

Then I output this (only tried output to clipboard for testing). The return at the end makes sure each code/text pair goes to a new line.

Then I added a new replace step to remove the tab+text+return after each random code, and I got what I wanted. The HTML file has 5 digit codes where each text line was, and a separate tab delimited list of code/text pairs is created. After testing with a couple files I made a few changes to the initial perl pattern, to make it ignore single returns, single space characters, and HTML tags within other HTML tags.

I used a "search/replace list" filter with a .tab file with the code/text pairs to reverse the process, and I restored the HTML file to original form (verified by MD5 hash

).

I only saw one small annoyance so far. Some text lines start with   (HTML code for space char). No biggie. But it would be great if the initial perl pattern could be adjusted to make textpipe ignore them (leave them in the HTML file, and not output to secondary file). Any suggestions?

That and if I could output the code/text list directly to MS Word format (which the translator guy prefers), would make this perfect for my needs!

Code: Select all

|--Perl pattern [>([^ <>\r].+)<] with [$0]
|  |  [ ] Match case
|  |  [ ] Whole words only
|  |  [ ] Case sensitive replace
|  |  [ ] Prompt on replace
|  |  [ ] Skip prompt if identical
|  |  [ ] First only
|  |  [ ] Extract matches
|  |  Maximum text buffer size 4096
|  |  [ ] Maximum match (greedy)
|  |  [ ] Allow comments
|  |  [X] '.' matches newline
|  |  [ ] UTF-8 Support
|  |
|  +--Perl pattern [^(.+)$] with [[@randomdigit@@randomdigit@@randomdigit@@randomdigit@@randomdigit@]\t$1\r\n]
|     |  [ ] Match case
|     |  [ ] Whole words only
|     |  [ ] Case sensitive replace
|     |  [ ] Prompt on replace
|     |  [ ] Skip prompt if identical
|     |  [ ] First only
|     |  [ ] Extract matches
|     |  Maximum text buffer size 4096
|     |  [ ] Maximum match (greedy)
|     |  [ ] Allow comments
|     |  [X] '.' matches newline
|     |  [ ] UTF-8 Support
|     |
|     |--Output to clipboard
|     |   
|     +--Perl pattern [(\[.*]).*\r\n] with [$1]
|           [ ] Match case
|           [ ] Whole words only
|           [ ] Case sensitive replace
|           [ ] Prompt on replace
|           [ ] Skip prompt if identical
|           [ ] First only
|           [ ] Extract matches
|           Maximum text buffer size 4096
|           [ ] Maximum match (greedy)
|           [ ] Allow comments
|           [X] '.' matches newline
|           [ ] UTF-8 Support

nelsoncruz · Post by **nelsoncruz** » Mon Apr 20, 2009 6:50 am

I revised the initial perl pattern to: >([^<\r][^<\r].*)<

This allows capturing text strings that start with a space, but not something like:
> <IMG...><

Neither < nor return chars are allowed as 1st or 2nd chars of the string.

I revised again to >( | |)([^<\r][^<\r].*|)( | |)(\r\n|)<. This avoids capture of empty spaces or   at the start or end of a text string, as well as a return/new line at the end.

nelsoncruz · Post by **nelsoncruz** » Tue Apr 21, 2009 6:28 am

My question now is, could I run this stuff with Textipe Lite?

I'm only using "Find perl patern" and secondary output functions in the filter I described, but I need search/replace list (with tab delimited text file) to reverse the process. Does Textpipe Lite have that? The Standard and Pro versions are too expensive for me...

Post by **DataMystic Support** » Tue Apr 21, 2009 2:50 pm

Hi Nelson,

The Lite version has the search/replace facility, but it doesn't have the secondary output filter. I don't think you need a secondary output filter- just sending results to a new folder should work.

nelsoncruz · Post by **nelsoncruz** » Wed Apr 22, 2009 7:50 pm

Hi Simon,

Remember that I need to output 2 files, the transformed HTML file + a text file with the extracted text. How do I do that without the secondary output?

Post by **DataMystic Support** » Thu Apr 23, 2009 7:54 am

You're right - you can't. Only the Pro version supports the Secondary Output (and/or VBScript). A complete reference chart is
http://www.datamystic.com/textpipe/pro_compare.html

DataMystic

Extracting text from HTML, replacing with random codes

Extracting text from HTML, replacing with random codes

Re: Extracting text from HTML, replacing with random codes

Re: Extracting text from HTML, replacing with random codes

Re: Extracting text from HTML, replacing with random codes

Re: Extracting text from HTML, replacing with random codes

Re: Extracting text from HTML, replacing with random codes

Re: Extracting text from HTML, replacing with random codes

Re: Extracting text from HTML, replacing with random codes

Re: Extracting text from HTML, replacing with random codes

Re: Extracting text from HTML, replacing with random codes

Re: Extracting text from HTML, replacing with random codes