Page 1 of 1

Unicode pattern reference help

Posted: Thu Jul 28, 2011 12:09 am
by dfhtextpipe
The help for Unicode Pattern Reference mentions character property classes (in the Notes section).

Character property classes are not tabulated in the help anywhere.
I found the following list in http://www.koders.com/delphi/fidDBC6499 ... rithm#L208.

Code: Select all

 //   Notes:
  //     o  Character property classes are \p or \P followed by a comma separated
  //        list of integers between 1 and 32.  These integers are references to
  //        the following character properties:
  //
  //         N	Character Property
  //         --------------------------
  //         1	_URE_NONSPACING
  //         2	_URE_COMBINING
  //         3	_URE_NUMDIGIT
  //         4	_URE_NUMOTHER
  //         5	_URE_SPACESEP
  //         6	_URE_LINESEP
  //         7	_URE_PARASEP
  //         8	_URE_CNTRL
  //         9	_URE_PRIVATE
  //         10	_URE_UPPER   (note: upper, lower and titel case classes need to have case
  //         11	_URE_LOWER          sensitive search be enabled to match correctly!)
  //         12	_URE_TITLE
  //         13	_URE_MODIFIER
  //         14	_URE_OTHERLETTER
  //         15	_URE_DASHPUNCT
  //         16	_URE_OPENPUNCT
  //         17	_URE_CLOSEPUNCT
  //         18	_URE_OTHERPUNCT
  //         19	_URE_MATHSYM
  //         20	_URE_CURRENCYSYM
  //         21	_URE_OTHERSYM
  //         22	_URE_LTR
  //         23	_URE_RTL
  //         24	_URE_EURONUM
  //         25	_URE_EURONUMSEP
  //         26	_URE_EURONUMTERM
  //         27	_URE_ARABNUM
  //         28	_URE_COMMONSEP
  //         29	_URE_BLOCKSEP
  //         30	_URE_SEGMENTSEP
  //         31	_URE_WHITESPACE
  //         32	_URE_OTHERNEUT
It might be sensible to document property classes in the TextPipe help file.
Whether the above definitions are the correct ones for TextPipe is not for me to say.

On the other hand, the word TCharacterCategory is not defined anywhere in the help, even though the notes refer to it twice.

David

Re: Unicode pattern reference help

Posted: Thu Aug 18, 2011 11:31 am
by jorjastandish
Thank you for sharing this one.:)

Re: Unicode pattern reference help

Posted: Thu Aug 18, 2011 5:29 pm
by DataMystic Support
Thanks David.

This will go into the help file for the next release (not 8.9.4).

Code: Select all

  TCharacterCategory = (
     // normative categories
0-      ccLetterUppercase,
1 -     ccLetterLowercase,
2 -     ccLetterTitlecase,
3 -     ccMarkNonSpacing,
4 -     ccMarkSpacingCombining,
5 -     ccMarkEnclosing,
6 -     ccNumberDecimalDigit,
7 -     ccNumberLetter,
8 -     ccNumberOther,
9 -     ccSeparatorSpace,
10 -     ccSeparatorLine,
11 -     ccSeparatorParagraph,
12 -     ccOtherControl,
13 -     ccOtherFormat,
14 -     ccOtherSurrogate,
15 -     ccOtherPrivate,
16 -     ccOtherUnassigned, 
// informative categories
17 -     ccLetterModifier,
18 -     ccLetterOther,
19 -     ccPunctuationConnector,
20 -     ccPunctuationDash,
21 -     ccPunctuationOpen,
22 -     ccPunctuationClose,
23 -     ccPunctuationInitialQuote,
24 -     ccPunctuationFinalQuote,
25 -     ccPunctuationOther,
26 -     ccSymbolMath,
27 -     ccSymbolCurrency,
28 -     ccSymbolModifier,
29 -     ccSymbolOther,
// bidirectional categories
30 -     ccLeftToRight,
31 -     ccLeftToRightEmbedding,
32 -     ccLeftToRightOverride,
33 -     ccRightToLeft,
34 -     ccRightToLeftArabic,
35 -     ccRightToLeftEmbedding,
36 -     ccRightToLeftOverride,
37 -     ccPopDirectionalFormat,
38 -     ccEuropeanNumber,
39 -     ccEuropeanNumberSeparator,
40 -     ccEuropeanNumberTerminator,
41 -     ccArabicNumber,
42 -     ccCommonNumberSeparator,
43 -     ccBoundaryNeutral,
44 -     ccSegmentSeparator,      // this includes tab and vertical tab
45 -     ccWhiteSpace,
46 -     ccOtherNeutrals,
// self defined categories, they do not appear in the Unicode data file
47 -          ccComposed,              // can be decomposed
48 -     ccNonBreaking,
49 -     ccSymmetric,             // has left and right forms
50 -     ccHexDigit,
51 -     ccQuotationMark,
52 -     ccMirroring,
53 -     ccSpaceOther,
54 -     ccAssigned               // means there is a definition in the Unicode standard
  );

Re: Unicode pattern reference help

Posted: Thu Aug 25, 2011 1:09 am
by dfhtextpipe
To be consistent, the CamelCase for item 36 should be ccRightToLeftOverride.

Re: Unicode pattern reference help

Posted: Thu Aug 25, 2011 1:13 am
by dfhtextpipe
Please also give some thought to providing a number of examples for using TCharacterCategory in TextPipe filters.

Re: Unicode pattern reference help

Posted: Thu Aug 25, 2011 1:17 am
by dfhtextpipe
This looks interesting.... found by Googling for TCharacterCategory

http://www.codeproject.com/KB/dotnet/Un ... elper.aspx