Suggestion: Provide Unicode to Perl Escape Codes filter
Moderators: DataMystic Support, Moderators, DataMystic Support, Moderators, DataMystic Support, Moderators
-
- Posts: 988
- Joined: Sun Dec 09, 2007 2:49 am
- Location: UK
Suggestion: Provide Unicode to Perl Escape Codes filter
If you have a lot of non-ASCII text that you want to format as the search items in an external replace list file,
it would be very useful to be able to prepare that list using TextPipe as the tool.
This would require a new filter called Unicode to Perl Escape Codes.
Likewise for the replace column, a filter called Unicode to UTF-8 Byte Codes would be useful too.
Adding both of these to the Filter library would be a really good enhancement.
Best regards,
David
it would be very useful to be able to prepare that list using TextPipe as the tool.
This would require a new filter called Unicode to Perl Escape Codes.
Likewise for the replace column, a filter called Unicode to UTF-8 Byte Codes would be useful too.
Adding both of these to the Filter library would be a really good enhancement.
Best regards,
David
David
- DataMystic Support
- Site Admin
- Posts: 2227
- Joined: Mon Jun 30, 2003 12:32 pm
- Location: Melbourne, Australia
- Contact:
Re: Suggestion: Provide Unicode to Perl Escape Codes filter
Hi David, this sounds like a filter list that could be prepared, rather than an internal filter.
If you design one we're happy to include it in the TextPipe distribution.
If you design one we're happy to include it in the TextPipe distribution.
-
- Posts: 988
- Joined: Sun Dec 09, 2007 2:49 am
- Location: UK
Re: Suggestion: Provide Unicode to Perl Escape Codes filter
Hi Simon,
A filter list would be somewhat unwieldy!
It would need to have 143859 entries - one for each named character in Unicode 13.0.
Given that the outputs are eminently calculable, it would be simpler (and surely much less effort) to design two suitable internal filters.
And using formulae for the task would ensure that the filters would be intrinsically extended whenever a new version of Unicode is announced.
David
A filter list would be somewhat unwieldy!
It would need to have 143859 entries - one for each named character in Unicode 13.0.
Given that the outputs are eminently calculable, it would be simpler (and surely much less effort) to design two suitable internal filters.
And using formulae for the task would ensure that the filters would be intrinsically extended whenever a new version of Unicode is announced.
David
David
- DataMystic Support
- Site Admin
- Posts: 2227
- Joined: Mon Jun 30, 2003 12:32 pm
- Location: Melbourne, Australia
- Contact:
Re: Suggestion: Provide Unicode to Perl Escape Codes filter
Hi David,
Can't this already be done?
Just use a restriction to match each line in turn, or the replace column:
For the Unicode to Perl Escape Codes
1. Hex dump
2. Add left margin of \x{
3. Add right margin of }
For the Unicode to UTF-8 Byte Codes
1. Hex dump
2. Search for 2 hex bytes, and add a \x before them
Can't this already be done?
Just use a restriction to match each line in turn, or the replace column:
For the Unicode to Perl Escape Codes
1. Hex dump
2. Add left margin of \x{
3. Add right margin of }
For the Unicode to UTF-8 Byte Codes
1. Hex dump
2. Search for 2 hex bytes, and add a \x before them
-
- Posts: 988
- Joined: Sun Dec 09, 2007 2:49 am
- Location: UK
Re: Suggestion: Provide Unicode to Perl Escape Codes filter
Hi Simon,
For codepoints 00-FF (in the ANSI range), the leading should be omitted, e.g. becomes simply because these are single byte characters.
Many UTF-8 byte codes are more than 2 bytes! e.g. タ is
And beyond the BMP, each Unicode character requires 4 bytes.
So yes, but it would be nice if all this could be done under the hood by means of internal filters!
David
For codepoints 00-FF (in the ANSI range), the leading
Code: Select all
00
Code: Select all
\x{00A0}
Code: Select all
\x{A0}
Many UTF-8 byte codes are more than 2 bytes! e.g. タ is
Code: Select all
\xEF\xBE\x80
So yes, but it would be nice if all this could be done under the hood by means of internal filters!
David
David
- DataMystic Support
- Site Admin
- Posts: 2227
- Joined: Mon Jun 30, 2003 12:32 pm
- Location: Melbourne, Australia
- Contact:
Re: Suggestion: Provide Unicode to Perl Escape Codes filter
Let's take Unicode to UTF-8 Byte Codes.
If we assume the Unicode is in UTF-8 form already (if not, just use a filter to convert it), then all you need to do is
1. Restrict to the text (a line or column)
2. Hex dump
3. Use a pattern match to match 2 hex characters at a time, and add \x{ } around it. If the text is handled a byte at a time, it doesn't matter how many bytes are in a sequence. And provided the text is converted to utf-8 at the start, the byte sequence will be correct.
Does that make sense?
If we assume the Unicode is in UTF-8 form already (if not, just use a filter to convert it), then all you need to do is
1. Restrict to the text (a line or column)
2. Hex dump
3. Use a pattern match to match 2 hex characters at a time, and add \x{ } around it. If the text is handled a byte at a time, it doesn't matter how many bytes are in a sequence. And provided the text is converted to utf-8 at the start, the byte sequence will be correct.
Does that make sense?
-
- Posts: 988
- Joined: Sun Dec 09, 2007 2:49 am
- Location: UK
Re: Suggestion: Provide Unicode to Perl Escape Codes filter
No - that makes no sense to me!
Perl Escape Codes for Unicode characters have a variable number of bytes, depending upon how each character is represented in UTF-8.
Squishing everything to 2 bytes would not work for their use in external character replacement list files (such as tab delimited), as described when I added this topic.
David
Perl Escape Codes for Unicode characters have a variable number of bytes, depending upon how each character is represented in UTF-8.
Squishing everything to 2 bytes would not work for their use in external character replacement list files (such as tab delimited), as described when I added this topic.
David
David
- DataMystic Support
- Site Admin
- Posts: 2227
- Joined: Mon Jun 30, 2003 12:32 pm
- Location: Melbourne, Australia
- Contact:
Re: Suggestion: Provide Unicode to Perl Escape Codes filter
I am not suggested everything gets squished to 2 bytes.
Please see sample filter attached.
Please see sample filter attached.
-
- Posts: 988
- Joined: Sun Dec 09, 2007 2:49 am
- Location: UK
Re: Suggestion: Provide Unicode to Perl Escape Codes filter
Dear Simon,
The outputs are not proper Perl Escape codes that would be recognised by a PCRE replace filter.
For codepoints beyond \xFF the should be {braces} around each character code!
e.g.should convert to
David
The outputs are not proper Perl Escape codes that would be recognised by a PCRE replace filter.
For codepoints beyond \xFF the should be {braces} around each character code!
e.g.
Code: Select all
ðñòóôõö÷øùúûüýþÿ
ĀāĂ㥹ĆćĈĉĊċČčĎď
𐀀𐀁𐀂𐀃𐀄𐀅𐀆𐀇𐀈𐀉𐀊𐀋𐀍𐀎𐀏
Code: Select all
\xF0\xF1\xF2\xF3\xF4\xF5\xF6\xF7\xF8\xF9\xFA\xFB\xFC\xFD\xFE\xFF
\x{0100}\x{0101}\x{0102}\x{0103}\x{0104}\x{0105}\x{0106}\x{0107}\x{0108}\x{0109}\x{010A}\x{010B}\x{010C}\x{010D}\x{010E}\x{010F}
\x{10000}\x{10001}\x{10002}\x{10003}\x{10004}\x{10005}\x{10006}\x{10007}\x{10008}\x{10009}\x{1000A}\x{1000B}\x{1000C}\x{1000D}\x{1000E}\x{1000F}
David
- DataMystic Support
- Site Admin
- Posts: 2227
- Joined: Mon Jun 30, 2003 12:32 pm
- Location: Melbourne, Australia
- Contact:
Re: Suggestion: Provide Unicode to Perl Escape Codes filter
Does PCRE need to care if a character is two bytes or not? Can't it just match the sequence? Isn't it an "all or nothing" literal comparison?
-
- Posts: 988
- Joined: Sun Dec 09, 2007 2:49 am
- Location: UK
Re: Suggestion: Provide Unicode to Perl Escape Codes filter
As I use PCRE Perl Escape codes in a lot of my replace lists for TextPipe, I think I know what I'm talking about.
Also, btw, most ASCII characters never need to be converted to Perl Escape Codes for use in replace lists.
Those should also be excluded if you do implement the suggestion as an internal TextPipe filter.
Currently, I use BabelPad to convert text to Perl Escape Codes (for search) & to UTF-8 byte codes (for replace).
Even so, I still need to simplify the Escape codes for codespoints in the range \x81 to \xFF,
ie, by removing the {00 and }.
David
Also, btw, most ASCII characters never need to be converted to Perl Escape Codes for use in replace lists.
Those should also be excluded if you do implement the suggestion as an internal TextPipe filter.
Currently, I use BabelPad to convert text to Perl Escape Codes (for search) & to UTF-8 byte codes (for replace).
Even so, I still need to simplify the Escape codes for codespoints in the range \x81 to \xFF,
ie, by removing the {00 and }.
David
David