Page 1 of 1

Pattern matching \s or \S

Posted: Thu Oct 29, 2015 6:39 am
by dfhtextpipe
The pattern matching reference states for \s
any white space character.
space, formfeed, newline, carriage return, horizontal tab, and vertical tab
In Notepad++, the pattern \s also matches the no break space character \xA0.

Why is TextPipe different?

Which program conforms to external standards in this respect?

Best regards,

David

Re: Pattern matching \s or \S

Posted: Thu Oct 29, 2015 8:30 am
by DataMystic Support
Hi David,

Which version of TextPipe? And is this using the perl pattern match?

According to some PCRE specs from 2012 http://www.pcre.org/pcre.txt

\s does include \xA0

I tested this in TextPipe 9.9.2 and it works fine. I put

Code: Select all

A0
in the trial run, then used the following filter list to convert the hex code to an actual character, then used \s to match on.
Which version of TP are you using?

|
|--Hex Decode
|
|--Perl pattern [\s] with [*]
| [ ] Match case
| [ ] Whole words only
| [ ] Case sensitive replace
| [X] Prompt on replace
| [ ] Skip prompt if identical
| [ ] First only
| [ ] Extract matches
| Maximum text buffer size 4096
| [ ] Maximum match (greedy)
| [ ] Allow comments
| [X] '.' matches newline
| [ ] UTF-8 Support
|

Re: Pattern matching \s or \S

Posted: Thu Oct 29, 2015 9:54 pm
by dfhtextpipe
I'm using TextPipe 9.9.2

I was merely going by the help text description for \s

It's now clear that \s does include \xA0 - so then please would you update the help file.

In the PCRE link you cited, this is significant:
This list may vary if locale-specific matching
is taking place. For example, in some locales the "non-breaking space"
character (\xA0) is recognized as white space, and in others the VT
character is not.
Q. Remove multiple whitespace only removes ordinary spaces and tabs, doesn't it?
i.e. It doesn't treat \xA0 as whitespace!

Thanks.

Best regards,

David

Re: Pattern matching \s or \S

Posted: Thu Oct 29, 2015 11:33 pm
by DataMystic Support
Yes - that filter is very fast and doesn't use PCRE regex. As you say,it doesn't handle \xA0.

Do you think we should change it to use regex and take advantage of the unicode extra characters?

Re: Pattern matching \s or \S

Posted: Fri Oct 30, 2015 1:55 am
by dfhtextpipe
Hi Simon,

I'd be inclined to say "no", so that it doesn't break existing filters, especially important for all users.

Unicode treats a number of special characters as "white space", but most users rarely come across any of them.

The few users that do can readily devise suitable replace filters.

e.g. Here's how I dealt with no break spaces:

Code: Select all

Comment...
|  Remove redundant no break spaces
|  
|   - Except before punctuation marks :;!?
|
+--Perl pattern [\xA0[\;\:\?\!]] with []
   |  [X] Match case
   |  [ ] Whole words only
   |  [ ] Case sensitive replace
   |  [ ] Prompt on replace
   |  [ ] Skip prompt if identical
   |  [ ] First only
   |  [ ] Extract matches
   |  Maximum text buffer size 4096
   |  [ ] Maximum match (greedy)
   |  [ ] Allow comments
   |  [ ] '.' matches newline
   |  [X] UTF-8 Support
   |
   +--Perl pattern [\xA0(\.)] with [$1]
         [X] Match case
         [ ] Whole words only
         [ ] Case sensitive replace
         [ ] Prompt on replace
         [ ] Skip prompt if identical
         [ ] First only
         [ ] Extract matches
         Maximum text buffer size 4096
         [ ] Maximum match (greedy)
         [ ] Allow comments
         [ ] '.' matches newline
         [X] UTF-8 Support

         [ ] Process longest strings first
         [ ] Simultaneous search

       Further search/replace list phrases (CSV format):
       \xA0,\x20
Having replaced the redundant nbsp by ordinary spaces, it can be followed by remove multiple whitespace, if so required.

The above method can easily be adapted.

Best regards,

David