PCRE POSIX Character Class [[:punct:]] and non-Roman scripts

Get help with installation and running here.

Moderators: DataMystic Support, Moderators, DataMystic Support, Moderators, DataMystic Support, Moderators

Post Reply
dfhtextpipe
Posts: 986
Joined: Sun Dec 09, 2007 2:49 am
Location: UK

PCRE POSIX Character Class [[:punct:]] and non-Roman scripts

Post by dfhtextpipe »

The PCRE POSIX Character Class [[:punct:]] does not find punctuation marks for any non-Roman scripts.
In UTF-8 mode, characters with values greater than 255 do not match any of the POSIX character classes.
So for example, it does not find these punctuation marks in a UTF-8 file containing Farsi (Persian) content:

Code: Select all

U+060C	،	1,679	ARABIC COMMA
U+061B	؛	50	ARABIC SEMICOLON
U+061F	؟	156	ARABIC QUESTION MARK
cf. The same character class does find these in Notepad++.

TextPipe should be enhanced to make this and similar character classes have the full scope of Unicode.

e.g. [[:digit;]] should be extended to cover the number characters all non-Roman scripts, etc.
David
User avatar
DataMystic Support
Site Admin
Posts: 2227
Joined: Mon Jun 30, 2003 12:32 pm
Location: Melbourne, Australia
Contact:

Re: PCRE POSIX Character Class [[:punct:]] and non-Roman scripts

Post by DataMystic Support »

We will check with the component developer.
dfhtextpipe
Posts: 986
Joined: Sun Dec 09, 2007 2:49 am
Location: UK

Re: PCRE POSIX Character Class [[:punct:]] and non-Roman scripts

Post by dfhtextpipe »

Hi Simon,

Any progress to report?

Further suggestion:

Wouldn't it be cool if every Unicode Script could have a named POSIX expression based on the ISO 15924 code?
Surely, I cannot be the first person to suggest such a notion?

Thus [[:Arab:]] would cover all the UTF-8 codepoints in the Arabic script.
Thus [[:Beng:]] would cover all the UTF-8 codepoints in the Bengali script.
Thus [[:Cyrl:]] would cover all the UTF-8 codepoints in the Cyrillic script.
etc.

It would be even cooler if the existing POSIX expressions [[:lower:]] and [[:upper:]] worked for every bicameral Unicode script.
cf. This seems to be already the case within Notepad++.

Code: Select all

Script Name	ISO 15924 Code	Characters	Unicode Version	Notes
Common	Zyyy	7,804	1.0	Characters that are common to two or more scripts
Inherited	Zinh	569	1.0	Combining characters that inherit the script of the character they are applied to
Adlam	Adlm	88	9.0	
Ahom	Ahom	58	8.0	
Anatolian Hieroglyphs	Hluw	583	8.0	
Arabic	Arab	1,281	1.0	
Armenian	Armn	95	1.0	
Avestan	Avst	61	5.2	
Balinese	Bali	121	5.0	
Bamum	Bamu	657	5.2	
Bassa Vah	Bass	36	7.0	
Batak	Batk	56	6.0	
Bengali	Beng	96	1.0	
Bhaiksuki	Bhks	97	9.0	
Bopomofo	Bopo	72	1.0	
Brahmi	Brah	109	6.0	
Braille	Brai	256	3.0	First defined as a script in 4.0
Buginese	Bugi	30	4.1	
Buhid	Buhd	20	3.2	
Canadian Aboriginal	Cans	710	3.0	
Carian	Cari	49	5.1	
Caucasian Albanian	Aghb	53	7.0	
Chakma	Cakm	70	6.1	
Cham	Cham	83	5.1	
Cherokee	Cher	172	3.0	
Coptic	Copt	137	1.0	Disunified from Greek in 4.1
Cuneiform	Xsux	1,234	5.0	
Cypriot	Cprt	55	4.0	
Cyrillic	Cyrl	443	1.0	
Deseret	Dsrt	80	3.1	
Devanagari	Deva	156	1.0	
Dogra	Dogr	60	11.0	
Duployan	Dupl	143	7.0	
Egyptian Hieroglyphs	Egyp	1,080	5.2	
Elbasan	Elba	40	7.0	
Elymaic	Elym	23	12.0	
Ethiopic	Ethi	495	3.0	
Georgian	Geor	173	1.0	
Glagolitic	Glag	132	4.1	
Gothic	Goth	27	3.1	
Grantha	Gran	85	7.0	
Greek	Grek	518	1.0	
Gujarati	Gujr	91	1.0	
Gunjala Gondi	Gong	63	11.0	
Gurmukhi	Guru	80	1.0	
Han	Hani	89,233	1.0	
Hangul	Hang	11,739	1.0	
Hanifi Rohingya	Rohg	50	11.0	
Hanunoo	Hano	21	3.2	
Hatran	Hatr	26	8.0	
Hebrew	Hebr	134	1.0	
Hiragana	Hira	379	1.0	
Imperial Aramaic	Armi	31	5.2	
Inscriptional Pahlavi	Phli	27	5.2	
Inscriptional Parthian	Prti	30	5.2	
Javanese	Java	90	5.2	
Kaithi	Kthi	67	5.2	
Kannada	Knda	89	1.0	
Katakana	Kana	304	1.0	
Kayah Li	Kali	47	5.1	
Kharoshthi	Khar	68	4.1	
Khmer	Khmr	146	3.0	
Khojki	Khoj	62	7.0	
Khudawadi	Sind	69	7.0	
Lao	Laoo	82	1.0	
Latin	Latn	1,366	1.0	
Lepcha	Lepc	74	5.1	
Limbu	Limb	68	4.0	
Linear A	Lina	341	7.0	
Linear B	Linb	211	4.0	
Lisu	Lisu	48	5.2	
Lycian	Lyci	29	5.1	
Lydian	Lydi	27	5.1	
Mahajani	Mahj	39	7.0	
Makasar	Maka	25	11.0	
Malayalam	Mlym	117	1.0	
Mandaic	Mand	29	6.0	
Manichaean	Mani	51	7.0	
Marchen	Marc	68	9.0	
Masaram Gondi	Gonm	75	10.0	
Medefaidrin	Medf	91	11.0	
Meetei Mayek	Mtei	79	5.2	
Mende Kikakui	Mend	213	7.0	
Meroitic Cursive	Merc	90	6.1	
Meroitic Hieroglyphs	Mero	32	6.1	
Miao	Plrd	149	6.1	
Modi	Modi	79	7.0	
Mongolian	Mong	167	3.0	
Mro	Mroo	43	7.0	
Multani	Mult	38	8.0	
Myanmar	Mymr	223	3.0	
Nabataean	Nbat	40	7.0	
Nandinagari	Nand	65	12.0	
New Tai Lue	Talu	83	4.1	
Newa	Newa	94	9.0	
N'Ko	Nkoo	62	5.0	
Nushu	Nshu	397	10.0	
Nyiakeng Puachue Hmong	Hmnp	71	12.0	
Ogham	Ogam	29	3.0	
Ol Chiki	Olck	48	5.1	
Old Hungarian	Hung	108	8.0	
Old Italic	Ital	39	3.1	
Old North Arabian	Narb	32	7.0	
Old Permic	Perm	43	7.0	
Old Persian	Xpeo	50	4.1	
Old Sogdian	Sogo	40	11.0	
Old South Arabian	Sarb	32	5.2	
Old Turkic	Orkh	73	5.2	
Oriya	Orya	90	1.0	
Osage	Osge	72	9.0	
Osmanya	Osma	40	4.0	
Pahawh Hmong	Hmng	127	7.0	
Palmyrene	Palm	32	7.0	
Pau Cin Hau	Pauc	57	7.0	
Phags-pa	Phag	56	5.0	
Phoenician	Phnx	29	5.0	
Psalter Pahlavi	Phlp	29	7.0	
Rejang	Rjng	37	5.1	
Runic	Runr	86	3.0	
Samaritan	Samr	61	5.2	
Saurashtra	Saur	82	5.1	
Sharada	Shrd	94	6.1	
Shavian	Shaw	48	4.0	
Siddham	Sidd	92	7.0	
SignWriting	Sgnw	672	8.0	
Sinhala	Sinh	110	3.0	
Sogdian	Sogd	42	11.0	
Sora Sompeng	Sora	35	6.1	
Soyombo	Soyo	83	10.0	
Sundanese	Sund	72	5.1	
Syloti Nagri	Sylo	44	4.1	
Syriac	Syrc	88	3.0	
Tagalog	Tglg	20	3.2	
Tagbanwa	Tagb	18	3.2	
Tai Le	Tale	35	4.0	
Tai Tham	Lana	127	5.2	
Tai Viet	Tavt	72	5.2	
Takri	Takr	67	6.1	
Tamil	Taml	123	1.0	
Tangut	Tang	6,892	9.0	
Telugu	Telu	98	1.0	
Thaana	Thaa	50	3.0	
Thai	Thai	86	1.0	
Tibetan	Tibt	207	1.0	Removed in 1.1 and reintroduced in 2.0
Tifinagh	Tfng	59	4.1	
Tirhuta	Tirh	82	7.0	
Ugaritic	Ugar	31	4.0	
Vai	Vaii	300	5.1	
Wancho	Wcho	59	12.0	
Warang Citi	Wara	84	7.0	
Yi	Yiii	1,220	3.0	
Zanabazar Square	Zanb	72	10.0	
Unknown	Zzzz	976,119	1.0	Private use characters, as well as reserved, non-character and surrogate code points
Best regards,

David
David
User avatar
DataMystic Support
Site Admin
Posts: 2227
Joined: Mon Jun 30, 2003 12:32 pm
Location: Melbourne, Australia
Contact:

Re: PCRE POSIX Character Class [[:punct:]] and non-Roman scripts

Post by DataMystic Support »

Sorry - this is reliant on the other change you asked me about.
dfhtextpipe
Posts: 986
Joined: Sun Dec 09, 2007 2:49 am
Location: UK

Re: PCRE POSIX Character Class [[:punct:]] and non-Roman scripts

Post by dfhtextpipe »

Was that the update to Unicode 12.1 ?

Now that's in place with TextPipe v11.x might the features suggested in this thread be now more feasible?

David
David
User avatar
DataMystic Support
Site Admin
Posts: 2227
Joined: Mon Jun 30, 2003 12:32 pm
Location: Melbourne, Australia
Contact:

Re: PCRE POSIX Character Class [[:punct:]] and non-Roman scripts

Post by DataMystic Support »

Yes, but it's also a question of whether we want to write a custom extension of the existing regex component and maintain that from here on.
Post Reply