Page 1 of 1

Extract Question/Problem

Posted: Fri Dec 19, 2003 4:14 am
by Fodor
Hello, all!

I'm evaluating TextPipe Pro to data mine a web site. The site has nested tables, but I've found a way to get to the data with an extract and a regular expression rather than having to manually remove the unneeded tables. First, I convert the UNIX EOL characters to DOS and remove all leading and trailing whitespace. Then, I try to use the following:

^(<p align=left>).*.$

The problem is that it doesn't work. While this regular expression should match any line (and the entire line) starting with "<p align=left>", when I run the filter, TextPipe finds the first "<p align=left>" and returns it with the remainder of the file following the first "<p align=left>".

It looks like TextPipe might be seeing the ".*" and matching the EOL characters rather than stopping at the ".$". Is that the problem? If so, is it a poorly formed regular expression?

What I'm I doing wrong?

Thanks!

Extract Question/Problem Follow-up

Posted: Fri Dec 19, 2003 4:26 am
by Fodor
I've changed my regular expression to:

^(<p align=left>).*(<br>)$

(matching any line (the entire line) beginning with "<p align=left>" and ending with "<br>")

This matches 0 items in my input file, although there are 4 such lines in the file.

Any ideas?

Thanks!

Posted: Fri Dec 19, 2003 8:20 am
by DataMystic Support
'.' by default matches new lines - check the pattern settings.

You could use [^\r\n] instead of '.' to prevent this.