Page 1 of 1

How to remove lines that begin with EM DASH (U+2014)?

Posted: Mon Mar 24, 2008 2:09 am
by dfhtextpipe
Using TextPipe Standard, how can I remove all lines that begin with an EM DASH ?

The text file to processed is encoded UTF-8 without BOM.

In Unicode, EM DASH is U+2014.

With the filter Remove Matching Lines, I have had no success to match this pattern.

Posted: Tue Mar 25, 2008 7:38 am
by DataMystic Support
You should check the encoding of this character.

Using this filter, you can paste a UTF character into the Trial Run Input area (first ensuring that 'Treat Trial Input as Unicode' is checked) and see what its hex equivalent is:

Code: Select all

|--Convert from UTF-16 to UTF-8
|   
|--Hex dump
|   
This shows an Em-dash pasted from MS Word as E1 90 A0.
So to find it with a search/replace, you need to use a non-unicode search replace with a find text of

Code: Select all

\xE1\x90\xA0

Remove matching lines not as versatile as the Replace filter

Posted: Tue Mar 25, 2008 6:48 pm
by dfhtextpipe
The issue seems to be that the Perl matching options for the Remove Matching Lines filter are not available as a popup like they are for the Replace filter with Perl matching selected.

I later found a UTF8 supported solution (with 'greedy matching') using the Replace filter, with

Code: Select all

Replace [^\x{2014}.*] by []
Even so, it would be neater if the same Perl options could be made available in the Remove [Non-]Matching Lines filters.

Posted: Tue Mar 25, 2008 7:07 pm
by DataMystic Support
As far as I know, if \x2014 works for you, then your file is UTF16, not UTF8.

It was definitely UTF8 (without BOM)

Posted: Wed Mar 26, 2008 2:20 am
by dfhtextpipe
Hi Simon,

Both SC Unipad and Notepad++ confirm that my file was encoded as UTF8 without BOM.

This is the main encoding that I use for work with Go Bible Creator.

Best regards,
David Haslam

Re: How to remove lines that begin with EM DASH (U+2014)?

Posted: Mon Apr 21, 2008 2:41 pm
by DataMystic Support
Hi David,

It seems my initial statement about an EM-DASH being \xE1\x90\xA0 in UTF-8 was wrong (I don't know what I was working with). I pasted this directly from MS Word into TextPipe's Trial Run Input area.

I just pasted an EM DASH from MS-Word into a Notepad file, and saved this as UTF-8 with an A and B on either side ie A-B.

Using TextPipe to generate a hex dump of this file, I see
00000000 EF BB BF 41 E2 80 94 42 ...A...B

The EF BB BF is the UTF-8 Byte Order Mark (BOM).
Then we have 41 for 'A'
Then E2 80 94 for the EM-Dash
and then 42 for 'B'.

Using a perl search/replace with the UTF-8 option enabled, we can use

Code: Select all

\x{2014}
as the matching criteria, and the matching engine replaces this with \xE2\x80\x94 internally.

However, if we specify a search term of \xE2\x80\x94 with the utf8 option ON, the matching engine will not find anything. If we use \xE2\x80\x94 with utf8 UNCHECKED, then it will find it as normal.

To remove lines beginning with an EM-Dash from a UTF-8 file, we use the pattern:
^\xE2\x80\x94

Re: How to remove lines that begin with EM DASH (U+2014)?

Posted: Wed Apr 23, 2008 5:42 pm
by dfhtextpipe
Thanks Simon,

As I see, it was indeed more complicated than it seemed at first glance. The concept that TextPipe changes its search pattern to what gets processed internally by the search engine is something that I had not imagined previously. It would be useful if this was better described in the help file.

Best regards,
David Haslam

Re: How to remove lines that begin with EM DASH (U+2014)?

Posted: Wed Oct 13, 2010 10:28 am
by stone76567
hi everyone..
thanks for all the information ive learned in this site..
btw im new here and i want to be a member of this site..
i hope you will accept me..
havev a nice day and God Bless..

how to treat depression

Re: How to remove lines that begin with EM DASH (U+2014)?

Posted: Sat Oct 16, 2010 5:19 am
by dfhtextpipe
Simon,

Can the substance of your answers be added to the help file, please?

David

Re: How to remove lines that begin with EM DASH (U+2014)?

Posted: Mon Oct 18, 2010 7:45 am
by DataMystic Support
Done.