Match to bounded area

Get help with installation and running here.

Moderators: DataMystic Support, Moderators, DataMystic Support, Moderators, DataMystic Support, Moderators

Post Reply
edalton
Posts: 2
Joined: Fri Jul 13, 2007 12:57 am

Match to bounded area

Post by edalton »

Hi,

I need to extract a segment from a number of files, so I'm trying to find out if TextPipe will meet my needs, rather than writing custom Perl. The segment begins with either "underwriters" or "underwriting" alone on a line, and ends with a line beginning with the word "total". I've tried a couple of different attempts at filters in TextPipe using EasyPattern and Perl, and I'm not getting the results I expect.

Before I use any other filters, I remove blanks from the beginning and end of lines, and remove blank lines. I also remove ANSI codes and binary characters, just in case.

The EasyPattern filter I'm trying looks like this:

[lineStart, ('underwriting' or 'underwriters'), lineEnd, capture(1+ (1+ paragraph, optional paragraphDelimiter)), lineStart, 'total']

not case sensitive.

Then I replace with:

*** underwriters section: $1

This pattern never matches; I don't get a "*** underwriters section" line in my output.

I can get it to return something this way:

Find:
[lineStart, ('underwriting' or 'underwriters'), lineEnd]
[capture(0+ (0+ paragraphchar, optional paragraphdelimiter))]
[capture(0+ (0+ paragraphchar, optional paragraphdelimiter))]

Replace:
*** begin underwriters section: $0
*** underwriter next line text: $1
*** underwriter next line 2 text: $2

I only get 2 lines following the section heading, though, and there may be an arbitrary number of lines I need to capture. So the problem seems to be that it won't keep matching after a line break. I'm testing this on content pasted into the trial area, in case that matters.

Then I tried using Perl regular expressions instead. My perl pattern looks like this:

^(underwriting|underwriters)$(.*)^Total

(not case sensitive)

Replace with:

*** new underwriter section: $1 $2

I get this result:

*** new underwriter section: UNDERWRITING

but I don't get any of the additional content between that text and the word "Total", which I should, shouldn't I?

But if I do it this way:
^(underwriting|underwriters)$
(.*)^Total

(note the line break between the first and second captures)

Then I get

*** new underwriter section: UNDERWRITING General

where "General" is the contents of the next line. I can add more (*.) on additional lines in the find section and get more lines, but again, I need to get an arbitrary number of lines until it hits the word "Total" at the beginning of a line. How can I do this?

Thanks,

Elizabeth Dalton
User avatar
DataMystic Support
Site Admin
Posts: 2227
Joined: Mon Jun 30, 2003 12:32 pm
Location: Melbourne, Australia
Contact:

Post by DataMystic Support »

Hi Elizabeth,

1. Both the EasyPattern [lineEnd] and the perl pattern $ match a position ('just before the end of the line') rather than a character (a [crlf] or \r\n).

You're matching a position but no characters, so you need to rewrite your pattern as

Code: Select all

[lineStart, ('underwriting' or 'underwriters'), lineEnd, cr, lf, capture(0+  paragraph, optional paragraphDelimiter), lineStart, 'total']
You may need to adjust this to suit the text - if you email us an example we can help.

2. Yes, I think you should be getting something in $2. Could you please email us your filter?
Post Reply