Splitting CamelCase words in Unicode?

Get help with installation and running here.

Moderators: DataMystic Support, Moderators, DataMystic Support, Moderators, DataMystic Support, Moderators

Post Reply
dfhtextpipe
Posts: 986
Joined: Sun Dec 09, 2007 2:49 am
Location: UK

Splitting CamelCase words in Unicode?

Post by dfhtextpipe »

I have a non-English text that includes lots of CamelCase words.
The language is based on an extended Latin alphabet with various diacritics.
The text is encoded as UTF-8 (without BOM) and the Unicode is normalized NFC.

The following regexp can find them when used in the Notepad++ search feature:

Code: Select all

[a-zàáãèéìíòóõùúṣẹẽọ][A-ZÀÁÃÈÉÌÍÒÓÕÙÚṢẸẼỌ]
(with Match case ticked)

Now supposing I'd like to use TextPipe to split these words where the CamelCase case-change is found.
i.e. where a lowercase letter is followed by an adjacent uppercase letter.
e.g. Change "CamelCase" to "Camel Case".

How should I do this in TextPipe?
i.e. What is the simplest method to implement this requirement?

Bear in mind that I can't just paste this code into a Perl replace filter:

Code: Select all

Replace ([a-zàáãèéìíòóõùúṣẹẽọ])([A-ZÀÁÃÈÉÌÍÒÓÕÙÚṢẸẼỌ]) by $1 $2
with Match case ticked.
I can do this in NotePad++ but this sort of thing simply doesn't work in TextPipe.

If it's so easy in Notepad++, then why is it so difficult in TextPipe?


David
David
User avatar
DataMystic Support
Site Admin
Posts: 2227
Joined: Mon Jun 30, 2003 12:32 pm
Location: Melbourne, Australia
Contact:

Re: Splitting CamelCase words in Unicode?

Post by DataMystic Support »

Hi David,

Did you check the 'Enabled UTF-8 support' option in the perl pattern extended options?

I pasted the expression you entered, added Match Case, and it worked fine on the 'CamelCase' sample text.

I assume I am missing something - could you please send me a sample non-English UTF-8 document to work on?
dfhtextpipe
Posts: 986
Joined: Sun Dec 09, 2007 2:49 am
Location: UK

Re: Splitting CamelCase words in Unicode?

Post by dfhtextpipe »

I'm well aware of the "'Enable UTF-8 support" - it's something I use almost every day.

Sorry, the issue is more specific than I first described.

The Perl pattern doesn't work in the Replace List filter! It only works in the Replace filter.

When you paste the pattern into the Replace List, it gets changed from

Code: Select all

([a-zàáãèéìíòóõùúṣẹẽọ])([A-ZÀÁÃÈÉÌÍÒÓÕÙÚṢẸẼỌ])
to

Code: Select all

([a-zàáãèéìíòóõùú????])([A-ZÀÁÃÈÉÌÍÒÓÕÙÚ????])
I can't see any good reason why these two types of filter should not have the same level of Perl pattern support.

So maybe it's not the level of support while the filter is running, but the GUI that's at fault here?

David
David
dfhtextpipe
Posts: 986
Joined: Sun Dec 09, 2007 2:49 am
Location: UK

Re: Splitting CamelCase words in Unicode?

Post by dfhtextpipe »

I should also add that even the Replace filter has a GUI problem, but only one which involves copying the filter to outside TextPipe.

If I select the filter and copy it to the clipboard, then paste the clipboard contents to a Unicode text editor, the question marks appear again.

Code: Select all

Perl pattern [([a-zàáãèéìíòóõùú????])([A-ZÀÁÃÈÉÌÍÒÓÕÙÚ????])] with [$1 $2]
   [X] Match case
   [ ] Whole words only
   [ ] Case sensitive replace
   [ ] Prompt on replace
   [ ] Skip prompt if identical
   [ ] First only
   [ ] Extract matches
   Maximum text buffer size 4096
   [ ] Maximum match (greedy)
   [ ] Allow comments
   [ ] '.' matches newline
   [X] UTF-8 Support
David
Post Reply