Unicode extended character properties?
Posted: Tue Mar 12, 2019 6:52 am
If you examine the extended character properties, you can see (e.g.) data such as the following:
Observe that one such property is
As the combining marks for different writing systems are not all in one block, it would make good sense for filters to make use of such a property by assigning to it a named class.
Of necessity, this would have to be a custom extension to the existing named classes in POSIX notation.
Unless, of course POSIX itself has made strides since TextPipe first made use of this construct.
This would facilitate operations such as "Strip diacritics" more readily.
Such a filter is easy to define by a two word phrase, but jolly hard to implement without such a predefined class.
Strip diacritics would become a two stage process.
Code: Select all
Property Value
Code Point Type Character
Code Point 0304
Age 1.1
Name COMBINING MACRON
Jamo Short Name
General Category Mn [Mark, Non-Spacing]
Canonical Combining Class 2
Decomposition Type None
Decomposition Mapping 0304
Numeric Type None
Numeric Value NaN
Bidi Class NSM [Non-Spacing Mark]
Bidi Paired Bracket Type None
Bidi Paired Bracket 0304
Bidi Mirrored No
Bidi Mirroring Glyph
Simple Uppercase Mapping 0304
Simple Lowercase Mapping 0304
Simple Titlecase Mapping 0304
Uppercase Mapping 0304
Lowercase Mapping 0304
Titlecase Mapping 0304
Simple Case Folding 0304
Case Folding 0304
Joining Type T [Transparent]
Joining Group No Joining Group
East Asian Width A [Ambiguous]
Line Break CM [Attached Characters and Combining Marks]
Script Zinh [Inherited]
Script Extensions Zinh
Dash No
White Space No
Hyphen No
Quotation Mark No
Radical No
Ideographic No
Unified Ideograph No
IDS Binary Operator No
IDS Trinary Operator No
Hangul Syllable Type NA [Not Applicable]
Default Ignorable Code Point No
Other Default Ignorable Code Point No
Alphabetic No
Other Alphabetic No
Uppercase No
Other Uppercase No
Lowercase No
Other Lowercase No
Math No
Other Math No
Hex Digit No
ASCII Hex Digit No
Noncharacter Code Point No
Variation Selector No
Bidi Control No
Join Control No
Grapheme Base No
Grapheme Extend Yes
Other Grapheme Extend No
Grapheme Link No
Sentence Terminal No
Extender No
Terminal Punctuation No
Diacritic Yes
Deprecated No
ID Start No
Other ID Start No
XID Start No
ID Continue Yes
Other ID Continue No
XID Continue Yes
Soft Dotted No
Logical Order Exception No
Pattern White Space No
Pattern Syntax No
Grapheme Cluster Break EX [Extend]
Word Break Extend
Sentence Break EX [Extend]
Composition Exclusion No
Full Composition Exclusion No
NFC Quick Check Maybe
NFD Quick Check Yes
NFKC Quick Check Maybe
NFKD Quick Check Yes
Expands On NFC No
Expands On NFD No
Expands On NFKC No
Expands On NFKD No
FC NFKC Closure 0304
Case Ignorable Yes
Cased No
Changes When Casefolded No
Changes When Casemapped No
Changes When NFKC Casefolded No
Changes When Lowercased No
Changes When Titlecased No
Changes When Uppercased No
NFKC Casefold 0304
Indic Syllabic Category Other
Indic Positional Category NA
Prepended Concatenation Mark No
Vertical Orientation R
Regional Indicator N
Block Combining Diacritical Marks
ISO Comment
Unicode 1 Name NON-SPACING MACRON
It's apparent that this property defines a class of characters called Attached Characters and Combining Marks.Line Break CM [Attached Characters and Combining Marks]
As the combining marks for different writing systems are not all in one block, it would make good sense for filters to make use of such a property by assigning to it a named class.
Of necessity, this would have to be a custom extension to the existing named classes in POSIX notation.
Unless, of course POSIX itself has made strides since TextPipe first made use of this construct.
This would facilitate operations such as "Strip diacritics" more readily.
Such a filter is easy to define by a two word phrase, but jolly hard to implement without such a predefined class.
Strip diacritics would become a two stage process.
Code: Select all
Normalize to NFD
Restrict to pattern equivalent to [[:CM:]]
+ Remove all