Pattern matching \s or \S

Get help with installation and running here.

Moderators: DataMystic Support, Moderators, DataMystic Support, Moderators, DataMystic Support, Moderators

Post Reply
dfhtextpipe
Posts: 988
Joined: Sun Dec 09, 2007 2:49 am
Location: UK

Pattern matching \s or \S

Post by dfhtextpipe »

The pattern matching reference states for \s
any white space character.
space, formfeed, newline, carriage return, horizontal tab, and vertical tab
In Notepad++, the pattern \s also matches the no break space character \xA0.

Why is TextPipe different?

Which program conforms to external standards in this respect?

Best regards,

David
David
User avatar
DataMystic Support
Site Admin
Posts: 2227
Joined: Mon Jun 30, 2003 12:32 pm
Location: Melbourne, Australia
Contact:

Re: Pattern matching \s or \S

Post by DataMystic Support »

Hi David,

Which version of TextPipe? And is this using the perl pattern match?

According to some PCRE specs from 2012 http://www.pcre.org/pcre.txt

\s does include \xA0

I tested this in TextPipe 9.9.2 and it works fine. I put

Code: Select all

A0
in the trial run, then used the following filter list to convert the hex code to an actual character, then used \s to match on.
Which version of TP are you using?

|
|--Hex Decode
|
|--Perl pattern [\s] with [*]
| [ ] Match case
| [ ] Whole words only
| [ ] Case sensitive replace
| [X] Prompt on replace
| [ ] Skip prompt if identical
| [ ] First only
| [ ] Extract matches
| Maximum text buffer size 4096
| [ ] Maximum match (greedy)
| [ ] Allow comments
| [X] '.' matches newline
| [ ] UTF-8 Support
|
dfhtextpipe
Posts: 988
Joined: Sun Dec 09, 2007 2:49 am
Location: UK

Re: Pattern matching \s or \S

Post by dfhtextpipe »

I'm using TextPipe 9.9.2

I was merely going by the help text description for \s

It's now clear that \s does include \xA0 - so then please would you update the help file.

In the PCRE link you cited, this is significant:
This list may vary if locale-specific matching
is taking place. For example, in some locales the "non-breaking space"
character (\xA0) is recognized as white space, and in others the VT
character is not.
Q. Remove multiple whitespace only removes ordinary spaces and tabs, doesn't it?
i.e. It doesn't treat \xA0 as whitespace!

Thanks.

Best regards,

David
David
User avatar
DataMystic Support
Site Admin
Posts: 2227
Joined: Mon Jun 30, 2003 12:32 pm
Location: Melbourne, Australia
Contact:

Re: Pattern matching \s or \S

Post by DataMystic Support »

Yes - that filter is very fast and doesn't use PCRE regex. As you say,it doesn't handle \xA0.

Do you think we should change it to use regex and take advantage of the unicode extra characters?
dfhtextpipe
Posts: 988
Joined: Sun Dec 09, 2007 2:49 am
Location: UK

Re: Pattern matching \s or \S

Post by dfhtextpipe »

Hi Simon,

I'd be inclined to say "no", so that it doesn't break existing filters, especially important for all users.

Unicode treats a number of special characters as "white space", but most users rarely come across any of them.

The few users that do can readily devise suitable replace filters.

e.g. Here's how I dealt with no break spaces:

Code: Select all

Comment...
|  Remove redundant no break spaces
|  
|   - Except before punctuation marks :;!?
|
+--Perl pattern [\xA0[\;\:\?\!]] with []
   |  [X] Match case
   |  [ ] Whole words only
   |  [ ] Case sensitive replace
   |  [ ] Prompt on replace
   |  [ ] Skip prompt if identical
   |  [ ] First only
   |  [ ] Extract matches
   |  Maximum text buffer size 4096
   |  [ ] Maximum match (greedy)
   |  [ ] Allow comments
   |  [ ] '.' matches newline
   |  [X] UTF-8 Support
   |
   +--Perl pattern [\xA0(\.)] with [$1]
         [X] Match case
         [ ] Whole words only
         [ ] Case sensitive replace
         [ ] Prompt on replace
         [ ] Skip prompt if identical
         [ ] First only
         [ ] Extract matches
         Maximum text buffer size 4096
         [ ] Maximum match (greedy)
         [ ] Allow comments
         [ ] '.' matches newline
         [X] UTF-8 Support

         [ ] Process longest strings first
         [ ] Simultaneous search

       Further search/replace list phrases (CSV format):
       \xA0,\x20
Having replaced the redundant nbsp by ordinary spaces, it can be followed by remove multiple whitespace, if so required.

The above method can easily be adapted.

Best regards,

David
David
Post Reply