Page 1 of 1
PCRE POSIX Character Class [[:punct:]] and non-Roman scripts
Posted: Mon Jan 14, 2019 7:29 pm
by dfhtextpipe
The PCRE POSIX Character Class
[[:punct:]] does not find punctuation marks for any non-Roman scripts.
In UTF-8 mode, characters with values greater than 255 do not match any of the POSIX character classes.
So for example, it does not find these punctuation marks in a UTF-8 file containing Farsi (Persian) content:
Code: Select all
U+060C ، 1,679 ARABIC COMMA
U+061B ؛ 50 ARABIC SEMICOLON
U+061F ؟ 156 ARABIC QUESTION MARK
cf. The same character class does find these in
Notepad++.
TextPipe should be enhanced to make this and similar character classes have the full scope of Unicode.
e.g.
[[:digit;]] should be extended to cover the number characters all non-Roman scripts, etc.
Re: PCRE POSIX Character Class [[:punct:]] and non-Roman scripts
Posted: Mon May 06, 2019 1:47 pm
by DataMystic Support
We will check with the component developer.
Re: PCRE POSIX Character Class [[:punct:]] and non-Roman scripts
Posted: Fri Jun 07, 2019 3:48 am
by dfhtextpipe
Hi Simon,
Any progress to report?
Further suggestion:
Wouldn't it be cool if every Unicode Script could have a named POSIX expression based on the
ISO 15924 code?
Surely, I cannot be the first person to suggest such a notion?
Thus [[:Arab:]] would cover all the UTF-8 codepoints in the
Arabic script.
Thus [[:Beng:]] would cover all the UTF-8 codepoints in the
Bengali script.
Thus [[:Cyrl:]] would cover all the UTF-8 codepoints in the
Cyrillic script.
etc.
It would be even cooler if the existing POSIX expressions [[:lower:]] and [[:upper:]] worked for every
bicameral Unicode script.
cf. This seems to be already the case within
Notepad++.
Code: Select all
Script Name ISO 15924 Code Characters Unicode Version Notes
Common Zyyy 7,804 1.0 Characters that are common to two or more scripts
Inherited Zinh 569 1.0 Combining characters that inherit the script of the character they are applied to
Adlam Adlm 88 9.0
Ahom Ahom 58 8.0
Anatolian Hieroglyphs Hluw 583 8.0
Arabic Arab 1,281 1.0
Armenian Armn 95 1.0
Avestan Avst 61 5.2
Balinese Bali 121 5.0
Bamum Bamu 657 5.2
Bassa Vah Bass 36 7.0
Batak Batk 56 6.0
Bengali Beng 96 1.0
Bhaiksuki Bhks 97 9.0
Bopomofo Bopo 72 1.0
Brahmi Brah 109 6.0
Braille Brai 256 3.0 First defined as a script in 4.0
Buginese Bugi 30 4.1
Buhid Buhd 20 3.2
Canadian Aboriginal Cans 710 3.0
Carian Cari 49 5.1
Caucasian Albanian Aghb 53 7.0
Chakma Cakm 70 6.1
Cham Cham 83 5.1
Cherokee Cher 172 3.0
Coptic Copt 137 1.0 Disunified from Greek in 4.1
Cuneiform Xsux 1,234 5.0
Cypriot Cprt 55 4.0
Cyrillic Cyrl 443 1.0
Deseret Dsrt 80 3.1
Devanagari Deva 156 1.0
Dogra Dogr 60 11.0
Duployan Dupl 143 7.0
Egyptian Hieroglyphs Egyp 1,080 5.2
Elbasan Elba 40 7.0
Elymaic Elym 23 12.0
Ethiopic Ethi 495 3.0
Georgian Geor 173 1.0
Glagolitic Glag 132 4.1
Gothic Goth 27 3.1
Grantha Gran 85 7.0
Greek Grek 518 1.0
Gujarati Gujr 91 1.0
Gunjala Gondi Gong 63 11.0
Gurmukhi Guru 80 1.0
Han Hani 89,233 1.0
Hangul Hang 11,739 1.0
Hanifi Rohingya Rohg 50 11.0
Hanunoo Hano 21 3.2
Hatran Hatr 26 8.0
Hebrew Hebr 134 1.0
Hiragana Hira 379 1.0
Imperial Aramaic Armi 31 5.2
Inscriptional Pahlavi Phli 27 5.2
Inscriptional Parthian Prti 30 5.2
Javanese Java 90 5.2
Kaithi Kthi 67 5.2
Kannada Knda 89 1.0
Katakana Kana 304 1.0
Kayah Li Kali 47 5.1
Kharoshthi Khar 68 4.1
Khmer Khmr 146 3.0
Khojki Khoj 62 7.0
Khudawadi Sind 69 7.0
Lao Laoo 82 1.0
Latin Latn 1,366 1.0
Lepcha Lepc 74 5.1
Limbu Limb 68 4.0
Linear A Lina 341 7.0
Linear B Linb 211 4.0
Lisu Lisu 48 5.2
Lycian Lyci 29 5.1
Lydian Lydi 27 5.1
Mahajani Mahj 39 7.0
Makasar Maka 25 11.0
Malayalam Mlym 117 1.0
Mandaic Mand 29 6.0
Manichaean Mani 51 7.0
Marchen Marc 68 9.0
Masaram Gondi Gonm 75 10.0
Medefaidrin Medf 91 11.0
Meetei Mayek Mtei 79 5.2
Mende Kikakui Mend 213 7.0
Meroitic Cursive Merc 90 6.1
Meroitic Hieroglyphs Mero 32 6.1
Miao Plrd 149 6.1
Modi Modi 79 7.0
Mongolian Mong 167 3.0
Mro Mroo 43 7.0
Multani Mult 38 8.0
Myanmar Mymr 223 3.0
Nabataean Nbat 40 7.0
Nandinagari Nand 65 12.0
New Tai Lue Talu 83 4.1
Newa Newa 94 9.0
N'Ko Nkoo 62 5.0
Nushu Nshu 397 10.0
Nyiakeng Puachue Hmong Hmnp 71 12.0
Ogham Ogam 29 3.0
Ol Chiki Olck 48 5.1
Old Hungarian Hung 108 8.0
Old Italic Ital 39 3.1
Old North Arabian Narb 32 7.0
Old Permic Perm 43 7.0
Old Persian Xpeo 50 4.1
Old Sogdian Sogo 40 11.0
Old South Arabian Sarb 32 5.2
Old Turkic Orkh 73 5.2
Oriya Orya 90 1.0
Osage Osge 72 9.0
Osmanya Osma 40 4.0
Pahawh Hmong Hmng 127 7.0
Palmyrene Palm 32 7.0
Pau Cin Hau Pauc 57 7.0
Phags-pa Phag 56 5.0
Phoenician Phnx 29 5.0
Psalter Pahlavi Phlp 29 7.0
Rejang Rjng 37 5.1
Runic Runr 86 3.0
Samaritan Samr 61 5.2
Saurashtra Saur 82 5.1
Sharada Shrd 94 6.1
Shavian Shaw 48 4.0
Siddham Sidd 92 7.0
SignWriting Sgnw 672 8.0
Sinhala Sinh 110 3.0
Sogdian Sogd 42 11.0
Sora Sompeng Sora 35 6.1
Soyombo Soyo 83 10.0
Sundanese Sund 72 5.1
Syloti Nagri Sylo 44 4.1
Syriac Syrc 88 3.0
Tagalog Tglg 20 3.2
Tagbanwa Tagb 18 3.2
Tai Le Tale 35 4.0
Tai Tham Lana 127 5.2
Tai Viet Tavt 72 5.2
Takri Takr 67 6.1
Tamil Taml 123 1.0
Tangut Tang 6,892 9.0
Telugu Telu 98 1.0
Thaana Thaa 50 3.0
Thai Thai 86 1.0
Tibetan Tibt 207 1.0 Removed in 1.1 and reintroduced in 2.0
Tifinagh Tfng 59 4.1
Tirhuta Tirh 82 7.0
Ugaritic Ugar 31 4.0
Vai Vaii 300 5.1
Wancho Wcho 59 12.0
Warang Citi Wara 84 7.0
Yi Yiii 1,220 3.0
Zanabazar Square Zanb 72 10.0
Unknown Zzzz 976,119 1.0 Private use characters, as well as reserved, non-character and surrogate code points
Best regards,
David
Re: PCRE POSIX Character Class [[:punct:]] and non-Roman scripts
Posted: Fri Jun 07, 2019 6:37 am
by DataMystic Support
Sorry - this is reliant on the other change you asked me about.
Re: PCRE POSIX Character Class [[:punct:]] and non-Roman scripts
Posted: Mon Mar 02, 2020 9:39 pm
by dfhtextpipe
Was that the update to Unicode 12.1 ?
Now that's in place with TextPipe v11.x might the features suggested in this thread be now more feasible?
David
Re: PCRE POSIX Character Class [[:punct:]] and non-Roman scripts
Posted: Wed Mar 04, 2020 4:26 pm
by DataMystic Support
Yes, but it's also a question of whether we want to write a custom extension of the existing regex component and maintain that from here on.