Page 1 of 1

Splitting CamelCase words in Unicode?

Posted: Sun Oct 14, 2012 2:39 am
by dfhtextpipe
I have a non-English text that includes lots of CamelCase words.
The language is based on an extended Latin alphabet with various diacritics.
The text is encoded as UTF-8 (without BOM) and the Unicode is normalized NFC.

The following regexp can find them when used in the Notepad++ search feature:

Code: Select all

[a-zàáãèéìíòóõùúṣẹẽọ][A-ZÀÁÃÈÉÌÍÒÓÕÙÚṢẸẼỌ]
(with Match case ticked)

Now supposing I'd like to use TextPipe to split these words where the CamelCase case-change is found.
i.e. where a lowercase letter is followed by an adjacent uppercase letter.
e.g. Change "CamelCase" to "Camel Case".

How should I do this in TextPipe?
i.e. What is the simplest method to implement this requirement?

Bear in mind that I can't just paste this code into a Perl replace filter:

Code: Select all

Replace ([a-zàáãèéìíòóõùúṣẹẽọ])([A-ZÀÁÃÈÉÌÍÒÓÕÙÚṢẸẼỌ]) by $1 $2
with Match case ticked.
I can do this in NotePad++ but this sort of thing simply doesn't work in TextPipe.

If it's so easy in Notepad++, then why is it so difficult in TextPipe?


David

Re: Splitting CamelCase words in Unicode?

Posted: Mon Oct 22, 2012 8:29 am
by DataMystic Support
Hi David,

Did you check the 'Enabled UTF-8 support' option in the perl pattern extended options?

I pasted the expression you entered, added Match Case, and it worked fine on the 'CamelCase' sample text.

I assume I am missing something - could you please send me a sample non-English UTF-8 document to work on?

Re: Splitting CamelCase words in Unicode?

Posted: Wed Oct 24, 2012 3:16 am
by dfhtextpipe
I'm well aware of the "'Enable UTF-8 support" - it's something I use almost every day.

Sorry, the issue is more specific than I first described.

The Perl pattern doesn't work in the Replace List filter! It only works in the Replace filter.

When you paste the pattern into the Replace List, it gets changed from

Code: Select all

([a-zàáãèéìíòóõùúṣẹẽọ])([A-ZÀÁÃÈÉÌÍÒÓÕÙÚṢẸẼỌ])
to

Code: Select all

([a-zàáãèéìíòóõùú????])([A-ZÀÁÃÈÉÌÍÒÓÕÙÚ????])
I can't see any good reason why these two types of filter should not have the same level of Perl pattern support.

So maybe it's not the level of support while the filter is running, but the GUI that's at fault here?

David

Re: Splitting CamelCase words in Unicode?

Posted: Wed Oct 24, 2012 3:31 am
by dfhtextpipe
I should also add that even the Replace filter has a GUI problem, but only one which involves copying the filter to outside TextPipe.

If I select the filter and copy it to the clipboard, then paste the clipboard contents to a Unicode text editor, the question marks appear again.

Code: Select all

Perl pattern [([a-zàáãèéìíòóõùú????])([A-ZÀÁÃÈÉÌÍÒÓÕÙÚ????])] with [$1 $2]
   [X] Match case
   [ ] Whole words only
   [ ] Case sensitive replace
   [ ] Prompt on replace
   [ ] Skip prompt if identical
   [ ] First only
   [ ] Extract matches
   Maximum text buffer size 4096
   [ ] Maximum match (greedy)
   [ ] Allow comments
   [ ] '.' matches newline
   [X] UTF-8 Support