Unicode pattern reference help

Get help with installation and running here.

Moderators: DataMystic Support, Moderators, DataMystic Support, Moderators, DataMystic Support, Moderators

Post Reply
dfhtextpipe
Posts: 986
Joined: Sun Dec 09, 2007 2:49 am
Location: UK

Unicode pattern reference help

Post by dfhtextpipe »

The help for Unicode Pattern Reference mentions character property classes (in the Notes section).

Character property classes are not tabulated in the help anywhere.
I found the following list in http://www.koders.com/delphi/fidDBC6499 ... rithm#L208.

Code: Select all

 //   Notes:
  //     o  Character property classes are \p or \P followed by a comma separated
  //        list of integers between 1 and 32.  These integers are references to
  //        the following character properties:
  //
  //         N	Character Property
  //         --------------------------
  //         1	_URE_NONSPACING
  //         2	_URE_COMBINING
  //         3	_URE_NUMDIGIT
  //         4	_URE_NUMOTHER
  //         5	_URE_SPACESEP
  //         6	_URE_LINESEP
  //         7	_URE_PARASEP
  //         8	_URE_CNTRL
  //         9	_URE_PRIVATE
  //         10	_URE_UPPER   (note: upper, lower and titel case classes need to have case
  //         11	_URE_LOWER          sensitive search be enabled to match correctly!)
  //         12	_URE_TITLE
  //         13	_URE_MODIFIER
  //         14	_URE_OTHERLETTER
  //         15	_URE_DASHPUNCT
  //         16	_URE_OPENPUNCT
  //         17	_URE_CLOSEPUNCT
  //         18	_URE_OTHERPUNCT
  //         19	_URE_MATHSYM
  //         20	_URE_CURRENCYSYM
  //         21	_URE_OTHERSYM
  //         22	_URE_LTR
  //         23	_URE_RTL
  //         24	_URE_EURONUM
  //         25	_URE_EURONUMSEP
  //         26	_URE_EURONUMTERM
  //         27	_URE_ARABNUM
  //         28	_URE_COMMONSEP
  //         29	_URE_BLOCKSEP
  //         30	_URE_SEGMENTSEP
  //         31	_URE_WHITESPACE
  //         32	_URE_OTHERNEUT
It might be sensible to document property classes in the TextPipe help file.
Whether the above definitions are the correct ones for TextPipe is not for me to say.

On the other hand, the word TCharacterCategory is not defined anywhere in the help, even though the notes refer to it twice.

David
David
jorjastandish
Posts: 1
Joined: Thu Aug 18, 2011 11:15 am

Re: Unicode pattern reference help

Post by jorjastandish »

Thank you for sharing this one.:)
User avatar
DataMystic Support
Site Admin
Posts: 2227
Joined: Mon Jun 30, 2003 12:32 pm
Location: Melbourne, Australia
Contact:

Re: Unicode pattern reference help

Post by DataMystic Support »

Thanks David.

This will go into the help file for the next release (not 8.9.4).

Code: Select all

  TCharacterCategory = (
     // normative categories
0-      ccLetterUppercase,
1 -     ccLetterLowercase,
2 -     ccLetterTitlecase,
3 -     ccMarkNonSpacing,
4 -     ccMarkSpacingCombining,
5 -     ccMarkEnclosing,
6 -     ccNumberDecimalDigit,
7 -     ccNumberLetter,
8 -     ccNumberOther,
9 -     ccSeparatorSpace,
10 -     ccSeparatorLine,
11 -     ccSeparatorParagraph,
12 -     ccOtherControl,
13 -     ccOtherFormat,
14 -     ccOtherSurrogate,
15 -     ccOtherPrivate,
16 -     ccOtherUnassigned, 
// informative categories
17 -     ccLetterModifier,
18 -     ccLetterOther,
19 -     ccPunctuationConnector,
20 -     ccPunctuationDash,
21 -     ccPunctuationOpen,
22 -     ccPunctuationClose,
23 -     ccPunctuationInitialQuote,
24 -     ccPunctuationFinalQuote,
25 -     ccPunctuationOther,
26 -     ccSymbolMath,
27 -     ccSymbolCurrency,
28 -     ccSymbolModifier,
29 -     ccSymbolOther,
// bidirectional categories
30 -     ccLeftToRight,
31 -     ccLeftToRightEmbedding,
32 -     ccLeftToRightOverride,
33 -     ccRightToLeft,
34 -     ccRightToLeftArabic,
35 -     ccRightToLeftEmbedding,
36 -     ccRightToLeftOverride,
37 -     ccPopDirectionalFormat,
38 -     ccEuropeanNumber,
39 -     ccEuropeanNumberSeparator,
40 -     ccEuropeanNumberTerminator,
41 -     ccArabicNumber,
42 -     ccCommonNumberSeparator,
43 -     ccBoundaryNeutral,
44 -     ccSegmentSeparator,      // this includes tab and vertical tab
45 -     ccWhiteSpace,
46 -     ccOtherNeutrals,
// self defined categories, they do not appear in the Unicode data file
47 -          ccComposed,              // can be decomposed
48 -     ccNonBreaking,
49 -     ccSymmetric,             // has left and right forms
50 -     ccHexDigit,
51 -     ccQuotationMark,
52 -     ccMirroring,
53 -     ccSpaceOther,
54 -     ccAssigned               // means there is a definition in the Unicode standard
  );
dfhtextpipe
Posts: 986
Joined: Sun Dec 09, 2007 2:49 am
Location: UK

Re: Unicode pattern reference help

Post by dfhtextpipe »

To be consistent, the CamelCase for item 36 should be ccRightToLeftOverride.
David
dfhtextpipe
Posts: 986
Joined: Sun Dec 09, 2007 2:49 am
Location: UK

Re: Unicode pattern reference help

Post by dfhtextpipe »

Please also give some thought to providing a number of examples for using TCharacterCategory in TextPipe filters.
David
dfhtextpipe
Posts: 986
Joined: Sun Dec 09, 2007 2:49 am
Location: UK

Re: Unicode pattern reference help

Post by dfhtextpipe »

This looks interesting.... found by Googling for TCharacterCategory

http://www.codeproject.com/KB/dotnet/Un ... elper.aspx
David
Post Reply