Page 1 of 1

Maximum match size

Posted: Sat Sep 20, 2008 10:31 am
by pecosbill
From help:
Changes the maximum search buffer size. If you need to match a string that is longer than 4096 bytes (4k), increase the Maximum text buffer size to be at least as large as the string you need to match.
This seems to me to be stating that it will look ahead X (default=4096) bytes from the current pointer for the potential match. Supposing that the desired match happens in 1024 chars, doesn't the pointer advance to the end of the successful match then look ahead for the next 4096 bytes?

I'm not seeing that. I have this:

Code: Select all

EasyPattern [DIVISION : DIVISION[1+ not '|']||] with []
|  [ ] Match case
|  [ ] Whole words only
|  [ ] Case sensitive replace
|  [ ] Prompt on replace
|  [ ] Skip prompt if identical
|  [ ] First only
|  [ ] Extract matches
|  Maximum text buffer size 73728
Yet I keep having to increase the size for it to work as I find larger and larger runs within larger files. The target text has that a single pattern match over no more than about 6000 chars (a single printed page) repeating from anywhere from 2000 to 324960 chars total. It seems to me that setting the max match size to 8096 bytes or even 14336 would be enough yet it doesn't. Yet, on my current test file with it's 324960 repeating range, it works with 14336. I would love to paste the data, but it's confidential.

Any guesses? Any reason not to set the max match to something enormous ?

Re: Maximum match size

Posted: Wed Sep 24, 2008 8:29 am
by DataMystic Support
It's not the buffer size, it's the maximum match size. So if you longest possible match is 70k long, set it to 70K.

Do you mean the total file size is 2000 to 324960, but the longest match you expect is 6000 chars? If so, then 6000 will be enough.

Remember that TextPipe never loads the entire file into memory at once. Therefore it needs to keep just enough to fully match the largest text pattern you have.

However, it needs just enough to match the pattern even if it occurs towards the end of the current buffer.

The max match should be set low (4k) as this makes the overall matching process more efficient, and stops pattern matches like .* from running away with the whole file.

BTW - are you documenting those huge filters with comments?