Observe that one such property is
Code: Select all
Property Value Code Point Type Character Code Point 0304 Age 1.1 Name COMBINING MACRON Jamo Short Name General Category Mn [Mark, Non-Spacing] Canonical Combining Class 2 Decomposition Type None Decomposition Mapping 0304 Numeric Type None Numeric Value NaN Bidi Class NSM [Non-Spacing Mark] Bidi Paired Bracket Type None Bidi Paired Bracket 0304 Bidi Mirrored No Bidi Mirroring Glyph Simple Uppercase Mapping 0304 Simple Lowercase Mapping 0304 Simple Titlecase Mapping 0304 Uppercase Mapping 0304 Lowercase Mapping 0304 Titlecase Mapping 0304 Simple Case Folding 0304 Case Folding 0304 Joining Type T [Transparent] Joining Group No Joining Group East Asian Width A [Ambiguous] Line Break CM [Attached Characters and Combining Marks] Script Zinh [Inherited] Script Extensions Zinh Dash No White Space No Hyphen No Quotation Mark No Radical No Ideographic No Unified Ideograph No IDS Binary Operator No IDS Trinary Operator No Hangul Syllable Type NA [Not Applicable] Default Ignorable Code Point No Other Default Ignorable Code Point No Alphabetic No Other Alphabetic No Uppercase No Other Uppercase No Lowercase No Other Lowercase No Math No Other Math No Hex Digit No ASCII Hex Digit No Noncharacter Code Point No Variation Selector No Bidi Control No Join Control No Grapheme Base No Grapheme Extend Yes Other Grapheme Extend No Grapheme Link No Sentence Terminal No Extender No Terminal Punctuation No Diacritic Yes Deprecated No ID Start No Other ID Start No XID Start No ID Continue Yes Other ID Continue No XID Continue Yes Soft Dotted No Logical Order Exception No Pattern White Space No Pattern Syntax No Grapheme Cluster Break EX [Extend] Word Break Extend Sentence Break EX [Extend] Composition Exclusion No Full Composition Exclusion No NFC Quick Check Maybe NFD Quick Check Yes NFKC Quick Check Maybe NFKD Quick Check Yes Expands On NFC No Expands On NFD No Expands On NFKC No Expands On NFKD No FC NFKC Closure 0304 Case Ignorable Yes Cased No Changes When Casefolded No Changes When Casemapped No Changes When NFKC Casefolded No Changes When Lowercased No Changes When Titlecased No Changes When Uppercased No NFKC Casefold 0304 Indic Syllabic Category Other Indic Positional Category NA Prepended Concatenation Mark No Vertical Orientation R Regional Indicator N Block Combining Diacritical Marks ISO Comment Unicode 1 Name NON-SPACING MACRON
It's apparent that this property defines a class of characters called Attached Characters and Combining Marks.Line Break CM [Attached Characters and Combining Marks]
As the combining marks for different writing systems are not all in one block, it would make good sense for filters to make use of such a property by assigning to it a named class.
Of necessity, this would have to be a custom extension to the existing named classes in POSIX notation.
Unless, of course POSIX itself has made strides since TextPipe first made use of this construct.
This would facilitate operations such as "Strip diacritics" more readily.
Such a filter is easy to define by a two word phrase, but jolly hard to implement without such a predefined class.
Strip diacritics would become a two stage process.
Code: Select all
Normalize to NFD Restrict to pattern equivalent to [[:CM:]] + Remove all