Extracting text from HTML, replacing with random codes
Moderators: DataMystic Support, Moderators, DataMystic Support, Moderators, DataMystic Support, Moderators
-
- Posts: 7
- Joined: Tue Apr 14, 2009 1:20 am
Extracting text from HTML, replacing with random codes
I'm wondering if Textpipe can do this. I want to extract every line of text from a HTML file, replacing each with a short (5 char max) random or sequencial code, and output every code + text line to a separate text file.
So, I have an HTML file like this:
<.....> text line 1 </.....>
<.....> text line 2 </.....>
<.....> text line 3 </.....>
And I want to end with:
<.....> [code1] </.....>
<.....> [code2] </.....>
<.....> [code3] </.....>
Plus output to a text file or clipboard the following:
[code1]text line 1
[code2]text line 2
[code3]text line 3
My goal is to send this text file to a translator (that works only in MS Word), and then reverse the process to insert translated text back into proper place. Can I do this somehow with Textpipe?
So, I have an HTML file like this:
<.....> text line 1 </.....>
<.....> text line 2 </.....>
<.....> text line 3 </.....>
And I want to end with:
<.....> [code1] </.....>
<.....> [code2] </.....>
<.....> [code3] </.....>
Plus output to a text file or clipboard the following:
[code1]text line 1
[code2]text line 2
[code3]text line 3
My goal is to send this text file to a translator (that works only in MS Word), and then reverse the process to insert translated text back into proper place. Can I do this somehow with Textpipe?
- DataMystic Support
- Site Admin
- Posts: 2227
- Joined: Mon Jun 30, 2003 12:32 pm
- Location: Melbourne, Australia
- Contact:
Re: Extracting text from HTML, replacing with random codes
Yes, it's possible.
First use a perl pattern to match the html text e.g.
[^<>]*?
Use a subfilter to take this result and replace it with a random digit, but also send it to a new file.
The example filter script filter\replace filename with file contents.fll should be a good guide
First use a perl pattern to match the html text e.g.
[^<>]*?
Use a subfilter to take this result and replace it with a random digit, but also send it to a new file.
The example filter script filter\replace filename with file contents.fll should be a good guide
-
- Posts: 7
- Joined: Tue Apr 14, 2009 1:20 am
Re: Extracting text from HTML, replacing with random codes
That pattern doesn't seem to work...
- DataMystic Support
- Site Admin
- Posts: 2227
- Joined: Mon Jun 30, 2003 12:32 pm
- Location: Melbourne, Australia
- Contact:
Re: Extracting text from HTML, replacing with random codes
The pattern is perfect:
Replace with
Code: Select all
[^<>]*?
Code: Select all
$0
-
- Posts: 7
- Joined: Tue Apr 14, 2009 1:20 am
Re: Extracting text from HTML, replacing with random codes
Either I'm doing something wrong, or [^<>]*? targets everything inside or outside <>.
If I make a "find pattern" for [^<>]*? and replace with $0, then add a subfilter replacing . with @randomdigit I get something like:
<4845>856202931309492836753331170<66489>
from
<font>You can type sample text in</font>
The objective here is:
<ignore>capture<ignore>
Here is something that does seem to work so far:
--Perl pattern [>(.+)<] with [$0]
|
+--Perl pattern [^(.+)$] with [[@randomdigit@@randomdigit@@randomdigit@@randomdigit@@randomdigit@]]
This does the following:
input: <font>You can type sample text in</font>
output: <font>[59473]</font>
Now the question is, how do I output a 2nd pure text file with:
[59473]You can type sample text in
If I make a "find pattern" for [^<>]*? and replace with $0, then add a subfilter replacing . with @randomdigit I get something like:
<4845>856202931309492836753331170<66489>
from
<font>You can type sample text in</font>
The objective here is:
<ignore>capture<ignore>
Here is something that does seem to work so far:
--Perl pattern [>(.+)<] with [$0]
|
+--Perl pattern [^(.+)$] with [[@randomdigit@@randomdigit@@randomdigit@@randomdigit@@randomdigit@]]
This does the following:
input: <font>You can type sample text in</font>
output: <font>[59473]</font>
Now the question is, how do I output a 2nd pure text file with:
[59473]You can type sample text in
-
- Posts: 7
- Joined: Tue Apr 14, 2009 1:20 am
Re: Extracting text from HTML, replacing with random codes
Solved it!
At the end of
+--Perl pattern [^(.+)$] with [[@randomdigit@@randomdigit@@randomdigit@@randomdigit@@randomdigit@]]
I added a tab (\t) + $1 (text) + return (\r\n).
Then I output this (only tried output to clipboard for testing). The return at the end makes sure each code/text pair goes to a new line.
Then I added a new replace step to remove the tab+text+return after each random code, and I got what I wanted. The HTML file has 5 digit codes where each text line was, and a separate tab delimited list of code/text pairs is created. After testing with a couple files I made a few changes to the initial perl pattern, to make it ignore single returns, single space characters, and HTML tags within other HTML tags.
I used a "search/replace list" filter with a .tab file with the code/text pairs to reverse the process, and I restored the HTML file to original form (verified by MD5 hash ).
I only saw one small annoyance so far. Some text lines start with (HTML code for space char). No biggie. But it would be great if the initial perl pattern could be adjusted to make textpipe ignore them (leave them in the HTML file, and not output to secondary file). Any suggestions?
That and if I could output the code/text list directly to MS Word format (which the translator guy prefers), would make this perfect for my needs!
At the end of
+--Perl pattern [^(.+)$] with [[@randomdigit@@randomdigit@@randomdigit@@randomdigit@@randomdigit@]]
I added a tab (\t) + $1 (text) + return (\r\n).
Then I output this (only tried output to clipboard for testing). The return at the end makes sure each code/text pair goes to a new line.
Then I added a new replace step to remove the tab+text+return after each random code, and I got what I wanted. The HTML file has 5 digit codes where each text line was, and a separate tab delimited list of code/text pairs is created. After testing with a couple files I made a few changes to the initial perl pattern, to make it ignore single returns, single space characters, and HTML tags within other HTML tags.
I used a "search/replace list" filter with a .tab file with the code/text pairs to reverse the process, and I restored the HTML file to original form (verified by MD5 hash ).
I only saw one small annoyance so far. Some text lines start with (HTML code for space char). No biggie. But it would be great if the initial perl pattern could be adjusted to make textpipe ignore them (leave them in the HTML file, and not output to secondary file). Any suggestions?
That and if I could output the code/text list directly to MS Word format (which the translator guy prefers), would make this perfect for my needs!
Code: Select all
|--Perl pattern [>([^ <>\r].+)<] with [$0]
| | [ ] Match case
| | [ ] Whole words only
| | [ ] Case sensitive replace
| | [ ] Prompt on replace
| | [ ] Skip prompt if identical
| | [ ] First only
| | [ ] Extract matches
| | Maximum text buffer size 4096
| | [ ] Maximum match (greedy)
| | [ ] Allow comments
| | [X] '.' matches newline
| | [ ] UTF-8 Support
| |
| +--Perl pattern [^(.+)$] with [[@randomdigit@@randomdigit@@randomdigit@@randomdigit@@randomdigit@]\t$1\r\n]
| | [ ] Match case
| | [ ] Whole words only
| | [ ] Case sensitive replace
| | [ ] Prompt on replace
| | [ ] Skip prompt if identical
| | [ ] First only
| | [ ] Extract matches
| | Maximum text buffer size 4096
| | [ ] Maximum match (greedy)
| | [ ] Allow comments
| | [X] '.' matches newline
| | [ ] UTF-8 Support
| |
| |--Output to clipboard
| |
| +--Perl pattern [(\[.*]).*\r\n] with [$1]
| [ ] Match case
| [ ] Whole words only
| [ ] Case sensitive replace
| [ ] Prompt on replace
| [ ] Skip prompt if identical
| [ ] First only
| [ ] Extract matches
| Maximum text buffer size 4096
| [ ] Maximum match (greedy)
| [ ] Allow comments
| [X] '.' matches newline
| [ ] UTF-8 Support
-
- Posts: 7
- Joined: Tue Apr 14, 2009 1:20 am
Re: Extracting text from HTML, replacing with random codes
I revised the initial perl pattern to: >([^<\r][^<\r].*)<
This allows capturing text strings that start with a space, but not something like:
> <IMG...><
Neither < nor return chars are allowed as 1st or 2nd chars of the string.
I revised again to >( | |)([^<\r][^<\r].*|)( | |)(\r\n|)<. This avoids capture of empty spaces or at the start or end of a text string, as well as a return/new line at the end.
This allows capturing text strings that start with a space, but not something like:
> <IMG...><
Neither < nor return chars are allowed as 1st or 2nd chars of the string.
I revised again to >( | |)([^<\r][^<\r].*|)( | |)(\r\n|)<. This avoids capture of empty spaces or at the start or end of a text string, as well as a return/new line at the end.
-
- Posts: 7
- Joined: Tue Apr 14, 2009 1:20 am
Re: Extracting text from HTML, replacing with random codes
My question now is, could I run this stuff with Textipe Lite?
I'm only using "Find perl patern" and secondary output functions in the filter I described, but I need search/replace list (with tab delimited text file) to reverse the process. Does Textpipe Lite have that? The Standard and Pro versions are too expensive for me...
I'm only using "Find perl patern" and secondary output functions in the filter I described, but I need search/replace list (with tab delimited text file) to reverse the process. Does Textpipe Lite have that? The Standard and Pro versions are too expensive for me...
- DataMystic Support
- Site Admin
- Posts: 2227
- Joined: Mon Jun 30, 2003 12:32 pm
- Location: Melbourne, Australia
- Contact:
Re: Extracting text from HTML, replacing with random codes
Hi Nelson,
The Lite version has the search/replace facility, but it doesn't have the secondary output filter. I don't think you need a secondary output filter- just sending results to a new folder should work.
The Lite version has the search/replace facility, but it doesn't have the secondary output filter. I don't think you need a secondary output filter- just sending results to a new folder should work.
-
- Posts: 7
- Joined: Tue Apr 14, 2009 1:20 am
Re: Extracting text from HTML, replacing with random codes
Hi Simon,
Remember that I need to output 2 files, the transformed HTML file + a text file with the extracted text. How do I do that without the secondary output?
Remember that I need to output 2 files, the transformed HTML file + a text file with the extracted text. How do I do that without the secondary output?
- DataMystic Support
- Site Admin
- Posts: 2227
- Joined: Mon Jun 30, 2003 12:32 pm
- Location: Melbourne, Australia
- Contact:
Re: Extracting text from HTML, replacing with random codes
You're right - you can't. Only the Pro version supports the Secondary Output (and/or VBScript). A complete reference chart is
http://www.datamystic.com/textpipe/pro_compare.html
http://www.datamystic.com/textpipe/pro_compare.html