Page 1 of 1

Split at Pattern - After this many

Posted: Thu Dec 31, 2009 12:44 am
by dfhtextpipe
The behavior of the After this many control for Split at Pattern was not quite how I understood it.

The filter I am designing is to convert a single whole Bible text file into 66 separate books.

When I changed the value of After this many from 1 to 2, instead of obtaining 67 output files, I got only 34

What I was looking for was a means to ignore the first instance of the pattern only, but to split at all the other occurrences.
This would avoid outputting an empty file before the first pattern that indicates the start of the book of Genesis.

It was a simple task to create earlier sub-filters to generate a suitable pattern for the start of each book.
I am using Split position = Before pattern. I do not wish to remove the pattern for Genesis.

I therefore suggest that the Split at Pattern filter be further enhanced to provide an additional control value Ignore this many.

Please review and advise.

Re: Split at Pattern - After this many

Posted: Thu Dec 31, 2009 12:50 am
by dfhtextpipe
The help for this feature is somewhat ambiguous. It states,
If the pattern specifies a single character only, such as a form feed (\f), you can split after this many occurrences of the pattern have been found in one file. The maximum is 2147483647.
Observations:
  • The phrase "a single character only" seems to be superfluous. The feature works for any pattern, not just a single character.
    The phrase "found in one file" is ambiguous. The actual behavior is this many occurrences must be found for each split.

Re: Split at Pattern - After this many

Posted: Fri Jan 08, 2010 9:52 am
by DataMystic Support
Thanks David,

We've altered the text and added an example for clarity:
if 'After this many' is set to 2, splits will occur after the 2, 4, 6, 8, 10.. matches.

Ignore this many - we'd prefer not to add this at this stage. Can you use a search/replace with 'Match first only' to prevent the initial file being created?

Re: Split at Pattern - After this many

Posted: Sun Jan 10, 2010 4:00 am
by dfhtextpipe
Simon,

You asked, "Can you use a search/replace with 'Match first only' to prevent the initial file being created?".

I don't think this would have any effect on a split filter.

I tried using a 'Restrict to lines 2 to END-0" and making the split filter a sub-filter, but it made no difference.

Maybe I am misunderstanding something. Please enlighten me.

David

Re: Split at Pattern - After this many

Posted: Sun Jan 10, 2010 1:03 pm
by DataMystic Support
You're splitting on a pattern, right?

So why not use a search/replace prior to the split filter to remove the first occurrence of that pattern?

Re: Split at Pattern - After this many

Posted: Thu Feb 25, 2010 6:31 pm
by dfhtextpipe
Because the split filter is before the pattern.

Code: Select all

Split before pattern ^\\id (\w+)$, filename %2.2d_%f.usfm, number 0 count 1
 
All the split files must contain the pattern.
If I remove the first pattern, then the first required split file would not contain the pattern.
I don't want a trivial zero length file creating just because the first pattern is on the first line of the file being processed.

Any better suggestions?

Re: Split at Pattern - After this many

Posted: Thu Feb 25, 2010 7:05 pm
by dfhtextpipe
I have solved the problem as follows:

The split filter is now simply

Code: Select all

Comment...
|  Notes on Split:
|  
|  2010-02-25
|  Added a formfeed (\f) before each \id (on the same line)
|  Removed the first formfeed (\f) using restrict to line range 1 with a subfilter
|  Changed the split pattern to \f and "At pattern"
|  Changed the split counter to start at 1
|  
|  This solved the following problem:
|  A zero length file was created with filename 00_*.usfm
|  
|  Possible further enhancement:
|  How to include the book ID in the split filename?
|
|--Restrict lines:Line 1 .. line 1
|  |
|  +--Perl pattern [^\f] with []
|        [ ] Match case
|        [ ] Whole words only
|        [ ] Case sensitive replace
|        [ ] Prompt on replace
|        [ ] Skip prompt if identical
|        [ ] First only
|        [ ] Extract matches
|        Maximum text buffer size 4096
|        [ ] Maximum match (greedy)
|        [ ] Allow comments
|        [ ] '.' matches newline
|        [X] UTF-8 Support
|      
+--Split at pattern ^\f, filename %2.2d_%f.usfm, number 1 count 1