Pattern matching \s or \S

dfhtextpipe · Post by **dfhtextpipe** » Thu Oct 29, 2015 6:39 am

The pattern matching reference states for \s

any white space character.
space, formfeed, newline, carriage return, horizontal tab, and vertical tab

In Notepad++, the pattern \s also matches the no break space character \xA0.

Why is TextPipe different?

Which program conforms to external standards in this respect?

Best regards,

David

Post by **DataMystic Support** » Thu Oct 29, 2015 8:30 am

Hi David,

Which version of TextPipe? And is this using the perl pattern match?

According to some PCRE specs from 2012 http://www.pcre.org/pcre.txt

\s does include \xA0

I tested this in TextPipe 9.9.2 and it works fine. I put

Code: Select all

A0

dfhtextpipe · Post by **dfhtextpipe** » Thu Oct 29, 2015 9:54 pm

I'm using TextPipe 9.9.2

I was merely going by the help text description for \s

It's now clear that \s does include \xA0 - so then please would you update the help file.

In the PCRE link you cited, this is significant:

This list may vary if locale-specific matching
is taking place. For example, in some locales the "non-breaking space"
character (\xA0) is recognized as white space, and in others the VT
character is not.

Q. Remove multiple whitespace only removes ordinary spaces and tabs, doesn't it?
i.e. It doesn't treat \xA0 as whitespace!

Thanks.

Best regards,

David

Post by **DataMystic Support** » Thu Oct 29, 2015 11:33 pm

Yes - that filter is very fast and doesn't use PCRE regex. As you say,it doesn't handle \xA0.

Do you think we should change it to use regex and take advantage of the unicode extra characters?

dfhtextpipe · Post by **dfhtextpipe** » Fri Oct 30, 2015 1:55 am

Hi Simon,

I'd be inclined to say "no", so that it doesn't break existing filters, especially important for all users.

Unicode treats a number of special characters as "white space", but most users rarely come across any of them.

The few users that do can readily devise suitable replace filters.

e.g. Here's how I dealt with no break spaces:

Code: Select all

Comment...
|  Remove redundant no break spaces
|  
|   - Except before punctuation marks :;!?
|
+--Perl pattern [\xA0[\;\:\?\!]] with []
   |  [X] Match case
   |  [ ] Whole words only
   |  [ ] Case sensitive replace
   |  [ ] Prompt on replace
   |  [ ] Skip prompt if identical
   |  [ ] First only
   |  [ ] Extract matches
   |  Maximum text buffer size 4096
   |  [ ] Maximum match (greedy)
   |  [ ] Allow comments
   |  [ ] '.' matches newline
   |  [X] UTF-8 Support
   |
   +--Perl pattern [\xA0(\.)] with [$1]
         [X] Match case
         [ ] Whole words only
         [ ] Case sensitive replace
         [ ] Prompt on replace
         [ ] Skip prompt if identical
         [ ] First only
         [ ] Extract matches
         Maximum text buffer size 4096
         [ ] Maximum match (greedy)
         [ ] Allow comments
         [ ] '.' matches newline
         [X] UTF-8 Support

         [ ] Process longest strings first
         [ ] Simultaneous search

       Further search/replace list phrases (CSV format):
       \xA0,\x20

Having replaced the redundant nbsp by ordinary spaces, it can be followed by remove multiple whitespace, if so required.

The above method can easily be adapted.

Best regards,

David

DataMystic

Pattern matching \s or \S

Pattern matching \s or \S

Re: Pattern matching \s or \S

Re: Pattern matching \s or \S

Re: Pattern matching \s or \S

Re: Pattern matching \s or \S