How to remove lines that begin with EM DASH (U+2014)?

Get help with installation and running here.

Moderators: DataMystic Support, Moderators, DataMystic Support, Moderators, DataMystic Support, Moderators

Post Reply
dfhtextpipe
Posts: 988
Joined: Sun Dec 09, 2007 2:49 am
Location: UK

How to remove lines that begin with EM DASH (U+2014)?

Post by dfhtextpipe »

Using TextPipe Standard, how can I remove all lines that begin with an EM DASH ?

The text file to processed is encoded UTF-8 without BOM.

In Unicode, EM DASH is U+2014.

With the filter Remove Matching Lines, I have had no success to match this pattern.
User avatar
DataMystic Support
Site Admin
Posts: 2227
Joined: Mon Jun 30, 2003 12:32 pm
Location: Melbourne, Australia
Contact:

Post by DataMystic Support »

You should check the encoding of this character.

Using this filter, you can paste a UTF character into the Trial Run Input area (first ensuring that 'Treat Trial Input as Unicode' is checked) and see what its hex equivalent is:

Code: Select all

|--Convert from UTF-16 to UTF-8
|   
|--Hex dump
|   
This shows an Em-dash pasted from MS Word as E1 90 A0.
So to find it with a search/replace, you need to use a non-unicode search replace with a find text of

Code: Select all

\xE1\x90\xA0
dfhtextpipe
Posts: 988
Joined: Sun Dec 09, 2007 2:49 am
Location: UK

Remove matching lines not as versatile as the Replace filter

Post by dfhtextpipe »

The issue seems to be that the Perl matching options for the Remove Matching Lines filter are not available as a popup like they are for the Replace filter with Perl matching selected.

I later found a UTF8 supported solution (with 'greedy matching') using the Replace filter, with

Code: Select all

Replace [^\x{2014}.*] by []
Even so, it would be neater if the same Perl options could be made available in the Remove [Non-]Matching Lines filters.
User avatar
DataMystic Support
Site Admin
Posts: 2227
Joined: Mon Jun 30, 2003 12:32 pm
Location: Melbourne, Australia
Contact:

Post by DataMystic Support »

As far as I know, if \x2014 works for you, then your file is UTF16, not UTF8.
dfhtextpipe
Posts: 988
Joined: Sun Dec 09, 2007 2:49 am
Location: UK

It was definitely UTF8 (without BOM)

Post by dfhtextpipe »

Hi Simon,

Both SC Unipad and Notepad++ confirm that my file was encoded as UTF8 without BOM.

This is the main encoding that I use for work with Go Bible Creator.

Best regards,
David Haslam
User avatar
DataMystic Support
Site Admin
Posts: 2227
Joined: Mon Jun 30, 2003 12:32 pm
Location: Melbourne, Australia
Contact:

Re: How to remove lines that begin with EM DASH (U+2014)?

Post by DataMystic Support »

Hi David,

It seems my initial statement about an EM-DASH being \xE1\x90\xA0 in UTF-8 was wrong (I don't know what I was working with). I pasted this directly from MS Word into TextPipe's Trial Run Input area.

I just pasted an EM DASH from MS-Word into a Notepad file, and saved this as UTF-8 with an A and B on either side ie A-B.

Using TextPipe to generate a hex dump of this file, I see
00000000 EF BB BF 41 E2 80 94 42 ...A...B

The EF BB BF is the UTF-8 Byte Order Mark (BOM).
Then we have 41 for 'A'
Then E2 80 94 for the EM-Dash
and then 42 for 'B'.

Using a perl search/replace with the UTF-8 option enabled, we can use

Code: Select all

\x{2014}
as the matching criteria, and the matching engine replaces this with \xE2\x80\x94 internally.

However, if we specify a search term of \xE2\x80\x94 with the utf8 option ON, the matching engine will not find anything. If we use \xE2\x80\x94 with utf8 UNCHECKED, then it will find it as normal.

To remove lines beginning with an EM-Dash from a UTF-8 file, we use the pattern:
^\xE2\x80\x94
dfhtextpipe
Posts: 988
Joined: Sun Dec 09, 2007 2:49 am
Location: UK

Re: How to remove lines that begin with EM DASH (U+2014)?

Post by dfhtextpipe »

Thanks Simon,

As I see, it was indeed more complicated than it seemed at first glance. The concept that TextPipe changes its search pattern to what gets processed internally by the search engine is something that I had not imagined previously. It would be useful if this was better described in the help file.

Best regards,
David Haslam
stone76567
Posts: 1
Joined: Wed Oct 13, 2010 10:19 am

Re: How to remove lines that begin with EM DASH (U+2014)?

Post by stone76567 »

hi everyone..
thanks for all the information ive learned in this site..
btw im new here and i want to be a member of this site..
i hope you will accept me..
havev a nice day and God Bless..

how to treat depression
dfhtextpipe
Posts: 988
Joined: Sun Dec 09, 2007 2:49 am
Location: UK

Re: How to remove lines that begin with EM DASH (U+2014)?

Post by dfhtextpipe »

Simon,

Can the substance of your answers be added to the help file, please?

David
David
Post Reply