Page 1 of 1

Find words starting with a capital letter

Posted: Thu Jan 05, 2012 5:08 pm
by gerd
Hi,
I am struggling with a filter that should perform the following:
Find all words in a text which start with a capital letter and consist of at least 6 characters and extract them to a csv file.
Example Text:
You can also perform Partial Trial Runs by right-clicking on filters in the Filter list.

Target of Extraction:
Partial Filter

because those two words consist of at least 6 characters.

I am playing trial and error with ([A-Z](\w{6,})[a-z]) and other versions without any success. Any idea?
thanks gerd

Re: Find words starting with a capital letter

Posted: Sat Jan 07, 2012 3:22 pm
by DataMystic Support
Hi Gerd,

Try:

Find (match case turned on)
[A-Z][a-z]{5,}?
Replace with
$0\r\n
Extract option on.

Re: Find words starting with a capital letter

Posted: Sat Jan 07, 2012 10:43 pm
by gerd
Thanks a lot,

that's the result what I was looking for. I guess I have somehow tried the line [A-Z][a-z]{5,} but surely without the missing question mark. I use the ? so far only as "at most one match". I guess I should take the time and go through the pages 77 - 106 of your manual carefully. Or do you have a hint to find the explanation how to use ? in this respect?
Anyhow, your hint is a great help for me.
gerd

Re: Find words starting with a capital letter

Posted: Fri Jan 13, 2012 4:10 am
by dfhtextpipe
Caveat!

Any case operations or case patterns are highly dependent on the alphabet for the language of the text being processed.

For languages with diacritics, the whole topic becomes much more complex.

And for some languages, there are further pitfalls to catch out the unwary. See
http://en.wikipedia.org/wiki/Dotless_i

which is a feature of Turkish, and a few other languages.

David

Re: Find words starting with a capital letter

Posted: Sat Jan 14, 2012 1:12 pm
by DataMystic Support
Hi David,

You could try using the perl regex '\w' to match word characters in a locale-specific way.

The ? at the end of a +, * or {} repetition reverses the normal greediness.

In TextPipe, the default is to be non-greedy, so [a-z]{5,} matches only 5 chars if it can, whereas
[a-z]{5,}? matches as many characters as it can.

You can toggle the default greediness using the pattern options button [...] for each pattern.

Re: Find words starting with a capital letter

Posted: Sat Jan 14, 2012 9:44 pm
by dfhtextpipe
Simon,

Although TexpPipe is locale sensitive, the fact is that I retain the English locale settings (region and language) for my PC,
even though I'm working on any number of different foreign language text files.

David