Unicode extended character properties?

Get help with installation and running here.

Moderators: DataMystic Support, Moderators, DataMystic Support, Moderators, DataMystic Support, Moderators

Post Reply
dfhtextpipe
Posts: 988
Joined: Sun Dec 09, 2007 2:49 am
Location: UK

Unicode extended character properties?

Post by dfhtextpipe »

If you examine the extended character properties, you can see (e.g.) data such as the following:

Code: Select all

Property	Value
Code Point Type	Character
Code Point	0304
Age	1.1
Name	COMBINING MACRON
Jamo Short Name	
General Category	Mn [Mark, Non-Spacing]
Canonical Combining Class	2
Decomposition Type	None
Decomposition Mapping	0304
Numeric Type	None
Numeric Value	NaN
Bidi Class	NSM [Non-Spacing Mark]
Bidi Paired Bracket Type	None
Bidi Paired Bracket	0304
Bidi Mirrored	No
Bidi Mirroring Glyph	
Simple Uppercase Mapping	0304
Simple Lowercase Mapping	0304
Simple Titlecase Mapping	0304
Uppercase Mapping	0304
Lowercase Mapping	0304
Titlecase Mapping	0304
Simple Case Folding	0304
Case Folding	0304
Joining Type	T [Transparent]
Joining Group	No Joining Group
East Asian Width	A [Ambiguous]
Line Break	CM [Attached Characters and Combining Marks]
Script	Zinh [Inherited]
Script Extensions	Zinh
Dash	No
White Space	No
Hyphen	No
Quotation Mark	No
Radical	No
Ideographic	No
Unified Ideograph	No
IDS Binary Operator	No
IDS Trinary Operator	No
Hangul Syllable Type	NA [Not Applicable]
Default Ignorable Code Point	No
Other Default Ignorable Code Point	No
Alphabetic	No
Other Alphabetic	No
Uppercase	No
Other Uppercase	No
Lowercase	No
Other Lowercase	No
Math	No
Other Math	No
Hex Digit	No
ASCII Hex Digit	No
Noncharacter Code Point	No
Variation Selector	No
Bidi Control	No
Join Control	No
Grapheme Base	No
Grapheme Extend	Yes
Other Grapheme Extend	No
Grapheme Link	No
Sentence Terminal	No
Extender	No
Terminal Punctuation	No
Diacritic	Yes
Deprecated	No
ID Start	No
Other ID Start	No
XID Start	No
ID Continue	Yes
Other ID Continue	No
XID Continue	Yes
Soft Dotted	No
Logical Order Exception	No
Pattern White Space	No
Pattern Syntax	No
Grapheme Cluster Break	EX [Extend]
Word Break	Extend
Sentence Break	EX [Extend]
Composition Exclusion	No
Full Composition Exclusion	No
NFC Quick Check	Maybe
NFD Quick Check	Yes
NFKC Quick Check	Maybe
NFKD Quick Check	Yes
Expands On NFC	No
Expands On NFD	No
Expands On NFKC	No
Expands On NFKD	No
FC NFKC Closure	0304
Case Ignorable	Yes
Cased	No
Changes When Casefolded	No
Changes When Casemapped	No
Changes When NFKC Casefolded	No
Changes When Lowercased	No
Changes When Titlecased	No
Changes When Uppercased	No
NFKC Casefold	0304
Indic Syllabic Category	Other
Indic Positional Category	NA
Prepended Concatenation Mark	No
Vertical Orientation	R
Regional Indicator	N
Block	Combining Diacritical Marks
ISO Comment	
Unicode 1 Name	NON-SPACING MACRON
Observe that one such property is
Line Break CM [Attached Characters and Combining Marks]
It's apparent that this property defines a class of characters called Attached Characters and Combining Marks.

As the combining marks for different writing systems are not all in one block, it would make good sense for filters to make use of such a property by assigning to it a named class.

Of necessity, this would have to be a custom extension to the existing named classes in POSIX notation.
Unless, of course POSIX itself has made strides since TextPipe first made use of this construct.

This would facilitate operations such as "Strip diacritics" more readily.
Such a filter is easy to define by a two word phrase, but jolly hard to implement without such a predefined class.

Strip diacritics would become a two stage process.

Code: Select all

Normalize to NFD
Restrict to pattern equivalent to [[:CM:]]
   + Remove all
David
Post Reply