Split at Pattern - After this many

Get help with installation and running here.

Moderators: DataMystic Support, Moderators, DataMystic Support, Moderators, DataMystic Support, Moderators

Post Reply
dfhtextpipe
Posts: 986
Joined: Sun Dec 09, 2007 2:49 am
Location: UK

Split at Pattern - After this many

Post by dfhtextpipe »

The behavior of the After this many control for Split at Pattern was not quite how I understood it.

The filter I am designing is to convert a single whole Bible text file into 66 separate books.

When I changed the value of After this many from 1 to 2, instead of obtaining 67 output files, I got only 34

What I was looking for was a means to ignore the first instance of the pattern only, but to split at all the other occurrences.
This would avoid outputting an empty file before the first pattern that indicates the start of the book of Genesis.

It was a simple task to create earlier sub-filters to generate a suitable pattern for the start of each book.
I am using Split position = Before pattern. I do not wish to remove the pattern for Genesis.

I therefore suggest that the Split at Pattern filter be further enhanced to provide an additional control value Ignore this many.

Please review and advise.
David
dfhtextpipe
Posts: 986
Joined: Sun Dec 09, 2007 2:49 am
Location: UK

Re: Split at Pattern - After this many

Post by dfhtextpipe »

The help for this feature is somewhat ambiguous. It states,
If the pattern specifies a single character only, such as a form feed (\f), you can split after this many occurrences of the pattern have been found in one file. The maximum is 2147483647.
Observations:
  • The phrase "a single character only" seems to be superfluous. The feature works for any pattern, not just a single character.
    The phrase "found in one file" is ambiguous. The actual behavior is this many occurrences must be found for each split.
David
User avatar
DataMystic Support
Site Admin
Posts: 2227
Joined: Mon Jun 30, 2003 12:32 pm
Location: Melbourne, Australia
Contact:

Re: Split at Pattern - After this many

Post by DataMystic Support »

Thanks David,

We've altered the text and added an example for clarity:
if 'After this many' is set to 2, splits will occur after the 2, 4, 6, 8, 10.. matches.

Ignore this many - we'd prefer not to add this at this stage. Can you use a search/replace with 'Match first only' to prevent the initial file being created?
dfhtextpipe
Posts: 986
Joined: Sun Dec 09, 2007 2:49 am
Location: UK

Re: Split at Pattern - After this many

Post by dfhtextpipe »

Simon,

You asked, "Can you use a search/replace with 'Match first only' to prevent the initial file being created?".

I don't think this would have any effect on a split filter.

I tried using a 'Restrict to lines 2 to END-0" and making the split filter a sub-filter, but it made no difference.

Maybe I am misunderstanding something. Please enlighten me.

David
David
User avatar
DataMystic Support
Site Admin
Posts: 2227
Joined: Mon Jun 30, 2003 12:32 pm
Location: Melbourne, Australia
Contact:

Re: Split at Pattern - After this many

Post by DataMystic Support »

You're splitting on a pattern, right?

So why not use a search/replace prior to the split filter to remove the first occurrence of that pattern?
dfhtextpipe
Posts: 986
Joined: Sun Dec 09, 2007 2:49 am
Location: UK

Re: Split at Pattern - After this many

Post by dfhtextpipe »

Because the split filter is before the pattern.

Code: Select all

Split before pattern ^\\id (\w+)$, filename %2.2d_%f.usfm, number 0 count 1
 
All the split files must contain the pattern.
If I remove the first pattern, then the first required split file would not contain the pattern.
I don't want a trivial zero length file creating just because the first pattern is on the first line of the file being processed.

Any better suggestions?
David
dfhtextpipe
Posts: 986
Joined: Sun Dec 09, 2007 2:49 am
Location: UK

Re: Split at Pattern - After this many

Post by dfhtextpipe »

I have solved the problem as follows:

The split filter is now simply

Code: Select all

Comment...
|  Notes on Split:
|  
|  2010-02-25
|  Added a formfeed (\f) before each \id (on the same line)
|  Removed the first formfeed (\f) using restrict to line range 1 with a subfilter
|  Changed the split pattern to \f and "At pattern"
|  Changed the split counter to start at 1
|  
|  This solved the following problem:
|  A zero length file was created with filename 00_*.usfm
|  
|  Possible further enhancement:
|  How to include the book ID in the split filename?
|
|--Restrict lines:Line 1 .. line 1
|  |
|  +--Perl pattern [^\f] with []
|        [ ] Match case
|        [ ] Whole words only
|        [ ] Case sensitive replace
|        [ ] Prompt on replace
|        [ ] Skip prompt if identical
|        [ ] First only
|        [ ] Extract matches
|        Maximum text buffer size 4096
|        [ ] Maximum match (greedy)
|        [ ] Allow comments
|        [ ] '.' matches newline
|        [X] UTF-8 Support
|      
+--Split at pattern ^\f, filename %2.2d_%f.usfm, number 1 count 1
    
David
Post Reply