Search/replace option Simultaneous search bug?

Get help with installation and running here.

Moderators: DataMystic Support, Moderators, DataMystic Support, Moderators, DataMystic Support, Moderators

Post Reply
dfhtextpipe
Posts: 988
Joined: Sun Dec 09, 2007 2:49 am
Location: UK

Search/replace option Simultaneous search bug?

Post by dfhtextpipe »

If the search type is Perl, and
if the search pattern contains \x{hhh..} character with hex code hhh... (UTF-8 mode only)
then if you tick the option Simultaneous search,
you get a popup dialog with this kind of error message.
"Error on line #: character value in \x{...} sequence is too large."
Popup screenshot
Popup screenshot
I see no fundamental reason why this should be invalid, so I guess this must be a software bug.

Best regards,

David
David
User avatar
DataMystic Support
Site Admin
Posts: 2227
Joined: Mon Jun 30, 2003 12:32 pm
Location: Melbourne, Australia
Contact:

Re: Search/replace option Simultaneous search bug?

Post by DataMystic Support »

Hi David,

Here is what I found, searching within PCRE results:

Note that despite its name, in utf8 mode \x{...} does not match utf8 sequences but rather "real" unicode codepoints.

Code: Select all

# cyrillic letter "ю is the codepoint 44E
# its utf8 representation is  D1 8E

$u = "Hello \xd1\x8e"; // this string is in utf8
echo $u, "<br>";
echo preg_replace('~\x{44e}~u', '*', $u); // preg matches codepoint  
dfhtextpipe
Posts: 988
Joined: Sun Dec 09, 2007 2:49 am
Location: UK

Re: Search/replace option Simultaneous search bug?

Post by dfhtextpipe »

How does that answer my issue?
David
User avatar
DataMystic Support
Site Admin
Posts: 2227
Joined: Mon Jun 30, 2003 12:32 pm
Location: Melbourne, Australia
Contact:

Re: Search/replace option Simultaneous search bug?

Post by DataMystic Support »

Are you specifying a code point, or a hex value?

What value are you specifying? It looks like a bug in the PCRE regex engine, but I can't confirm without this.
dfhtextpipe
Posts: 988
Joined: Sun Dec 09, 2007 2:49 am
Location: UK

Re: Search/replace option Simultaneous search bug?

Post by dfhtextpipe »

Here's the sub-filter that fails when I tick Simultaneous search:

Code: Select all

Perl pattern [\x{05BE}] with [#]
   [X] Match case
   [ ] Whole words only
   [ ] Case sensitive replace
   [ ] Prompt on replace
   [ ] Skip prompt if identical
   [ ] First only
   [ ] Extract matches
   Maximum text buffer size 4096
   [ ] Maximum match (greedy)
   [ ] Allow comments
   [ ] '.' matches newline
   [X] UTF-8 Support

 Further search/replace list phrases (CSV format):
 \x{05C0},#
 \x{05C3},#
 \x{05C6},#
 
Here's what it looks like when that option is not ticked:

Code: Select all

Perl pattern [\x{05BE}] with [#]
   [X] Match case
   [ ] Whole words only
   [ ] Case sensitive replace
   [ ] Prompt on replace
   [ ] Skip prompt if identical
   [ ] First only
   [ ] Extract matches
   Maximum text buffer size 4096
   [ ] Maximum match (greedy)
   [ ] Allow comments
   [ ] '.' matches newline
   [X] UTF-8 Support

 Further search/replace list phrases (CSV format):
 \x{05C0},#
 \x{05C3},#
 \x{05C6},#
 
It would seem that the tick box option has no counterpart in the visual representation of the replace list filter.
This in itself is also a cause for concern.
Screenshot of my replace list filter.
Screenshot of my replace list filter.
David
User avatar
DataMystic Support
Site Admin
Posts: 2227
Joined: Mon Jun 30, 2003 12:32 pm
Location: Melbourne, Australia
Contact:

Re: Search/replace option Simultaneous search bug?

Post by DataMystic Support »

Hi David,

The Display text now outputs 'simultaneous search' and 'Process longest strings first' options for the next release.

According to the PCRE spec:
\x{hhh..} - character with hex code hhh.. (non-JavaScript mode)

By default, after \x, from zero to two hexadecimal digits are read (letters can be in upper or lower case). Any number of hexadecimal digits may appear between \x{ and }, but the character code is constrained as follows:

8-bit non-UTF mode less than 0x100
8-bit UTF-8 mode less than 0x10ffff and a valid codepoint
16-bit non-UTF mode less than 0x10000
16-bit UTF-16 mode less than 0x10ffff and a valid codepoint
32-bit non-UTF mode less than 0x80000000
32-bit UTF-32 mode less than 0x10ffff and a valid codepoint

Invalid Unicode codepoints are the range 0xd800 to 0xdfff (the so-called "surrogate" codepoints), and 0xffef.

If characters other than hexadecimal digits appear between \x{ and }, or if there is no terminating }, this form of escape is not recognized. Instead, the initial \x will be interpreted as a basic hexadecimal escape, with no following digits, giving a character whose value is zero.
dfhtextpipe
Posts: 988
Joined: Sun Dec 09, 2007 2:49 am
Location: UK

Re: Search/replace option Simultaneous search bug?

Post by dfhtextpipe »

I will try again with the next release, though I suspect that the things you've fixed for filter display do not address the underlying issue.

David
David
Post Reply