Match to bounded area
Posted: Fri Jul 13, 2007 1:25 am
Hi,
I need to extract a segment from a number of files, so I'm trying to find out if TextPipe will meet my needs, rather than writing custom Perl. The segment begins with either "underwriters" or "underwriting" alone on a line, and ends with a line beginning with the word "total". I've tried a couple of different attempts at filters in TextPipe using EasyPattern and Perl, and I'm not getting the results I expect.
Before I use any other filters, I remove blanks from the beginning and end of lines, and remove blank lines. I also remove ANSI codes and binary characters, just in case.
The EasyPattern filter I'm trying looks like this:
[lineStart, ('underwriting' or 'underwriters'), lineEnd, capture(1+ (1+ paragraph, optional paragraphDelimiter)), lineStart, 'total']
not case sensitive.
Then I replace with:
*** underwriters section: $1
This pattern never matches; I don't get a "*** underwriters section" line in my output.
I can get it to return something this way:
Find:
[lineStart, ('underwriting' or 'underwriters'), lineEnd]
[capture(0+ (0+ paragraphchar, optional paragraphdelimiter))]
[capture(0+ (0+ paragraphchar, optional paragraphdelimiter))]
Replace:
*** begin underwriters section: $0
*** underwriter next line text: $1
*** underwriter next line 2 text: $2
I only get 2 lines following the section heading, though, and there may be an arbitrary number of lines I need to capture. So the problem seems to be that it won't keep matching after a line break. I'm testing this on content pasted into the trial area, in case that matters.
Then I tried using Perl regular expressions instead. My perl pattern looks like this:
^(underwriting|underwriters)$(.*)^Total
(not case sensitive)
Replace with:
*** new underwriter section: $1 $2
I get this result:
*** new underwriter section: UNDERWRITING
but I don't get any of the additional content between that text and the word "Total", which I should, shouldn't I?
But if I do it this way:
^(underwriting|underwriters)$
(.*)^Total
(note the line break between the first and second captures)
Then I get
*** new underwriter section: UNDERWRITING General
where "General" is the contents of the next line. I can add more (*.) on additional lines in the find section and get more lines, but again, I need to get an arbitrary number of lines until it hits the word "Total" at the beginning of a line. How can I do this?
Thanks,
Elizabeth Dalton
I need to extract a segment from a number of files, so I'm trying to find out if TextPipe will meet my needs, rather than writing custom Perl. The segment begins with either "underwriters" or "underwriting" alone on a line, and ends with a line beginning with the word "total". I've tried a couple of different attempts at filters in TextPipe using EasyPattern and Perl, and I'm not getting the results I expect.
Before I use any other filters, I remove blanks from the beginning and end of lines, and remove blank lines. I also remove ANSI codes and binary characters, just in case.
The EasyPattern filter I'm trying looks like this:
[lineStart, ('underwriting' or 'underwriters'), lineEnd, capture(1+ (1+ paragraph, optional paragraphDelimiter)), lineStart, 'total']
not case sensitive.
Then I replace with:
*** underwriters section: $1
This pattern never matches; I don't get a "*** underwriters section" line in my output.
I can get it to return something this way:
Find:
[lineStart, ('underwriting' or 'underwriters'), lineEnd]
[capture(0+ (0+ paragraphchar, optional paragraphdelimiter))]
[capture(0+ (0+ paragraphchar, optional paragraphdelimiter))]
Replace:
*** begin underwriters section: $0
*** underwriter next line text: $1
*** underwriter next line 2 text: $2
I only get 2 lines following the section heading, though, and there may be an arbitrary number of lines I need to capture. So the problem seems to be that it won't keep matching after a line break. I'm testing this on content pasted into the trial area, in case that matters.
Then I tried using Perl regular expressions instead. My perl pattern looks like this:
^(underwriting|underwriters)$(.*)^Total
(not case sensitive)
Replace with:
*** new underwriter section: $1 $2
I get this result:
*** new underwriter section: UNDERWRITING
but I don't get any of the additional content between that text and the word "Total", which I should, shouldn't I?
But if I do it this way:
^(underwriting|underwriters)$
(.*)^Total
(note the line break between the first and second captures)
Then I get
*** new underwriter section: UNDERWRITING General
where "General" is the contents of the next line. I can add more (*.) on additional lines in the find section and get more lines, but again, I need to get an arbitrary number of lines until it hits the word "Total" at the beginning of a line. How can I do this?
Thanks,
Elizabeth Dalton