User-named character classes?

dfhtextpipe · Post by **dfhtextpipe** » Sat Dec 31, 2011 4:09 am

Suppose I wish to match for a pattern that [e.g.] consists of any UTF-8 character in the Czech alphabet (in either case).
See http://en.wikipedia.org/wiki/Czech_alphabet

Excluding the "Ch" diglot, a Perl pattern that does this would be as follows:

Code: Select all

[A-Za-z\x{00C1}\x{00C9}\x{00CD}\x{00D3}\x{00DA}\x{00DD}\x{00E1}\x{00E9}\x{00ED}\x{00F3}\x{00FA}\x{00FD}\x{010C}\x{010D}\x{010E}\x{010F}\x{011A}\x{011B}\x{0147}\x{0148}\x{0158}\x{0159}\x{0160}\x{0161}\x{0164}\x{0165}\x{016E}\x{016F}\x{017D}\x{017E}]

This is equivalent to the shorter pattern

Code: Select all

[A-Za-zÁÉÍÓÚÝáéíóúýČčĎďĚěŇňŘřŠšŤťŮůŽž]

The latter will not work when entered as a simple Perl pattern in TextPipe, so one has to use the more complicated one with all the hexadecimal codes.

It would be much simpler if there was a facility to define user-named character classes, such that a much shorter pattern name can be used, perhaps by extending the POSIX notation such that

Code: Select all

[:czech:]

would be equivalent to the above pattern.

I can't use captured text and store it in a global variable, as the files to be processed will not contain it.

Am I forced to resort to VBScript, or is there a simpler more open method?

David

Post by **DataMystic Support** » Sun Jan 22, 2012 10:28 pm

Hi David,

A proposed solution, is in the perl search/replace mode, when utf8 support is checked, the unicode data entered is converted to utf8 before being passed to the perl module.

This allows your simpler pattern to pass through without any problems, and results in the same output as the more complex sample.

You can see this trial in action in
http://www.datamystic.com/textpipestandard2.exe - available in an hour or so.

- let me know if it meets your needs, and also if there are any side-effects.

dfhtextpipe · Post by **dfhtextpipe** » Wed Jan 25, 2012 6:22 pm

Hi Simon,

I was away when you posted that - if I get time today, I'll give it a try.

David

DataMystic

User-named character classes?

User-named character classes?

Re: User-named character classes?

Re: User-named character classes?