How to remove lines that begin with EM DASH (U+2014)?
Moderators: DataMystic Support, Moderators, DataMystic Support, Moderators, DataMystic Support, Moderators
-
- Posts: 988
- Joined: Sun Dec 09, 2007 2:49 am
- Location: UK
How to remove lines that begin with EM DASH (U+2014)?
Using TextPipe Standard, how can I remove all lines that begin with an EM DASH ?
The text file to processed is encoded UTF-8 without BOM.
In Unicode, EM DASH is U+2014.
With the filter Remove Matching Lines, I have had no success to match this pattern.
The text file to processed is encoded UTF-8 without BOM.
In Unicode, EM DASH is U+2014.
With the filter Remove Matching Lines, I have had no success to match this pattern.
- DataMystic Support
- Site Admin
- Posts: 2227
- Joined: Mon Jun 30, 2003 12:32 pm
- Location: Melbourne, Australia
- Contact:
You should check the encoding of this character.
Using this filter, you can paste a UTF character into the Trial Run Input area (first ensuring that 'Treat Trial Input as Unicode' is checked) and see what its hex equivalent is:
This shows an Em-dash pasted from MS Word as E1 90 A0.
So to find it with a search/replace, you need to use a non-unicode search replace with a find text of
Using this filter, you can paste a UTF character into the Trial Run Input area (first ensuring that 'Treat Trial Input as Unicode' is checked) and see what its hex equivalent is:
Code: Select all
|--Convert from UTF-16 to UTF-8
|
|--Hex dump
|
So to find it with a search/replace, you need to use a non-unicode search replace with a find text of
Code: Select all
\xE1\x90\xA0
-
- Posts: 988
- Joined: Sun Dec 09, 2007 2:49 am
- Location: UK
Remove matching lines not as versatile as the Replace filter
The issue seems to be that the Perl matching options for the Remove Matching Lines filter are not available as a popup like they are for the Replace filter with Perl matching selected.
I later found a UTF8 supported solution (with 'greedy matching') using the Replace filter, withEven so, it would be neater if the same Perl options could be made available in the Remove [Non-]Matching Lines filters.
I later found a UTF8 supported solution (with 'greedy matching') using the Replace filter, with
Code: Select all
Replace [^\x{2014}.*] by []
- DataMystic Support
- Site Admin
- Posts: 2227
- Joined: Mon Jun 30, 2003 12:32 pm
- Location: Melbourne, Australia
- Contact:
-
- Posts: 988
- Joined: Sun Dec 09, 2007 2:49 am
- Location: UK
It was definitely UTF8 (without BOM)
Hi Simon,
Both SC Unipad and Notepad++ confirm that my file was encoded as UTF8 without BOM.
This is the main encoding that I use for work with Go Bible Creator.
Best regards,
David Haslam
Both SC Unipad and Notepad++ confirm that my file was encoded as UTF8 without BOM.
This is the main encoding that I use for work with Go Bible Creator.
Best regards,
David Haslam
- DataMystic Support
- Site Admin
- Posts: 2227
- Joined: Mon Jun 30, 2003 12:32 pm
- Location: Melbourne, Australia
- Contact:
Re: How to remove lines that begin with EM DASH (U+2014)?
Hi David,
It seems my initial statement about an EM-DASH being \xE1\x90\xA0 in UTF-8 was wrong (I don't know what I was working with). I pasted this directly from MS Word into TextPipe's Trial Run Input area.
I just pasted an EM DASH from MS-Word into a Notepad file, and saved this as UTF-8 with an A and B on either side ie A-B.
Using TextPipe to generate a hex dump of this file, I see
00000000 EF BB BF 41 E2 80 94 42 ...A...B
The EF BB BF is the UTF-8 Byte Order Mark (BOM).
Then we have 41 for 'A'
Then E2 80 94 for the EM-Dash
and then 42 for 'B'.
Using a perl search/replace with the UTF-8 option enabled, we can use
as the matching criteria, and the matching engine replaces this with \xE2\x80\x94 internally.
However, if we specify a search term of \xE2\x80\x94 with the utf8 option ON, the matching engine will not find anything. If we use \xE2\x80\x94 with utf8 UNCHECKED, then it will find it as normal.
To remove lines beginning with an EM-Dash from a UTF-8 file, we use the pattern:
^\xE2\x80\x94
It seems my initial statement about an EM-DASH being \xE1\x90\xA0 in UTF-8 was wrong (I don't know what I was working with). I pasted this directly from MS Word into TextPipe's Trial Run Input area.
I just pasted an EM DASH from MS-Word into a Notepad file, and saved this as UTF-8 with an A and B on either side ie A-B.
Using TextPipe to generate a hex dump of this file, I see
00000000 EF BB BF 41 E2 80 94 42 ...A...B
The EF BB BF is the UTF-8 Byte Order Mark (BOM).
Then we have 41 for 'A'
Then E2 80 94 for the EM-Dash
and then 42 for 'B'.
Using a perl search/replace with the UTF-8 option enabled, we can use
Code: Select all
\x{2014}
However, if we specify a search term of \xE2\x80\x94 with the utf8 option ON, the matching engine will not find anything. If we use \xE2\x80\x94 with utf8 UNCHECKED, then it will find it as normal.
To remove lines beginning with an EM-Dash from a UTF-8 file, we use the pattern:
^\xE2\x80\x94
-
- Posts: 988
- Joined: Sun Dec 09, 2007 2:49 am
- Location: UK
Re: How to remove lines that begin with EM DASH (U+2014)?
Thanks Simon,
As I see, it was indeed more complicated than it seemed at first glance. The concept that TextPipe changes its search pattern to what gets processed internally by the search engine is something that I had not imagined previously. It would be useful if this was better described in the help file.
Best regards,
David Haslam
As I see, it was indeed more complicated than it seemed at first glance. The concept that TextPipe changes its search pattern to what gets processed internally by the search engine is something that I had not imagined previously. It would be useful if this was better described in the help file.
Best regards,
David Haslam
-
- Posts: 1
- Joined: Wed Oct 13, 2010 10:19 am
Re: How to remove lines that begin with EM DASH (U+2014)?
hi everyone..
thanks for all the information ive learned in this site..
btw im new here and i want to be a member of this site..
i hope you will accept me..
havev a nice day and God Bless..
how to treat depression
thanks for all the information ive learned in this site..
btw im new here and i want to be a member of this site..
i hope you will accept me..
havev a nice day and God Bless..
how to treat depression
-
- Posts: 988
- Joined: Sun Dec 09, 2007 2:49 am
- Location: UK
Re: How to remove lines that begin with EM DASH (U+2014)?
Simon,
Can the substance of your answers be added to the help file, please?
David
Can the substance of your answers be added to the help file, please?
David
David
- DataMystic Support
- Site Admin
- Posts: 2227
- Joined: Mon Jun 30, 2003 12:32 pm
- Location: Melbourne, Australia
- Contact: