I want to take multiple passes over a file or over the pattens recognized in a file.
I am processing HTML / XML and want to
1) Convert &xxx; to the appropriate character
2) parse a containing HTML pair (<metadata>...</metadata>
3) Parse the metadata string for other HTML pairs and put the entire result on a single line
I can't find out how a subfilter differs from a top level filter, and none of the sequences I try has worked.
Trial #1:
1) Extract metadata pair to a line
<meta_data><ASIN>(.*)</ASIN><title>(.*)</title><authors>(.*)</authors><publishers>(.*)</publishers><publication_date>(\d{4}-\d{2}-\d{2}).*</publication_date>.*></meta_data>
$1\t$2\t$3\t$4\t$5
This much appears to work
When I add subfilters to convert the &xxx; patterns, nothing happens to them. If I make them top level patterns either before or after the metadata pattern, they don't appear to have any effect.
It seems that there is something I don't understand about how passes are made over the file and how subfilters play into the data processing.
How are subfilters handled?
Moderators: DataMystic Support, Moderators, DataMystic Support, Moderators, DataMystic Support, Moderators
- DataMystic Support
- Site Admin
- Posts: 2227
- Joined: Mon Jun 30, 2003 12:32 pm
- Location: Melbourne, Australia
- Contact:
Re: How are subfilters handled?
Are the patterns for handling the html entities (&xxx;) a subfilter ie inside the filter that identifies
<meta_data><ASIN>(.*)</ASIN><title>(.*)</
To add as a subfilter, drag and drop the html entity filters on top of the pattern match. If this doesn't work, please paste an extract from File Menu\Export.
<meta_data><ASIN>(.*)</ASIN><title>(.*)</
To add as a subfilter, drag and drop the html entity filters on top of the pattern match. If this doesn't work, please paste an extract from File Menu\Export.
Re: How are subfilters handled?
I have tried the element filters as subfilters of the metadata filter, as top level filters before and after the meta filter, and none of them seem to work.
When I add subfilters to convert the &xxx; patterns, nothing happens to them. If I make them top level patterns either before or after the metadata pattern, they don't appear to have any effect.
I get the metadata line extracted, but the HTML entities remain.
Here is the export.
TextPipe Single User Edition
Purchased by: DYNAMIC Alternatives, DYNAMIC Alternatives
Filter Title: J:\DYNALT\Amazon\Kindle.fll
Filter List
-----------
Filter options
| [ ] Log to file
| [ ] Append to logfile
| Log filename: %USERPROFILE%\textpipe.log
| Threshold 500
|
|--Input from file(s)
| [ ] Confirm before processing each file
| [ ] Confirm before processing read/only files
| [ ] Delete input files after processing
| Skip binary files
| Sample size 100 characters
|
|--Perl pattern [<meta_data><ASIN>(.*)</ASIN><title>(.*)</title><authors>(.*)</authors><publishers>(.*)</publishers><publication_date>(\d{4}-\d{2}-\d{2}).*</publication_date>.*></meta_data>] with [$1\t$2\t$3\t$4\t$5]
| | [ ] Match case
| | [ ] Whole words only
| | [ ] Case sensitive replace
| | [ ] Prompt on replace
| | [ ] Skip prompt if identical
| | [ ] First only
| | [ ] Extract matches
| | Maximum text buffer size 4096
| | [ ] Maximum match (greedy)
| | [ ] Allow comments
| | [X] '.' matches newline
| | [ ] UTF-8 Support
| |
| |--Replace [&] with [&]
| | [ ] Match case
| | [ ] Whole words only
| | [ ] Case sensitive replace
| | [ ] Prompt on replace
| | [ ] Skip prompt if identical
| | [ ] First only
| | [ ] Extract matches
| |
| |--Replace [>] with [>]
| | [ ] Match case
| | [ ] Whole words only
| | [ ] Case sensitive replace
| | [ ] Prompt on replace
| | [ ] Skip prompt if identical
| | [ ] First only
| | [ ] Extract matches
| |
| |--Replace [<] with [<]
| | [ ] Match case
| | [ ] Whole words only
| | [ ] Case sensitive replace
| | [ ] Prompt on replace
| | [ ] Skip prompt if identical
| | [ ] First only
| | [ ] Extract matches
| |
| +--Replace ["] with ["]
| [ ] Match case
| [ ] Whole words only
| [ ] Case sensitive replace
| [ ] Prompt on replace
| [ ] Skip prompt if identical
| [ ] First only
| [ ] Extract matches
|
+--Output to file(s)
[ ] Only update date on changed files
[ ] Append mode
[ ] Change extension to: .txt
[ ] Open output file
Only output modified files Backup mode [ ] Remove empty output files
Files List
----------
J:\DYNALT\Amazon\KindleSyncMetadataCache.xml
Use the line below to remove common non-text files from website processing
.[ 'gif' or 'png' or 'jpg' or 'bmp' or 'avi' or 'ico' or 'mp3', lineEnd ]
Use the line below to remove common non-text folders from website processing
_vti
When I add subfilters to convert the &xxx; patterns, nothing happens to them. If I make them top level patterns either before or after the metadata pattern, they don't appear to have any effect.
I get the metadata line extracted, but the HTML entities remain.
Here is the export.
TextPipe Single User Edition
Purchased by: DYNAMIC Alternatives, DYNAMIC Alternatives
Filter Title: J:\DYNALT\Amazon\Kindle.fll
Filter List
-----------
Filter options
| [ ] Log to file
| [ ] Append to logfile
| Log filename: %USERPROFILE%\textpipe.log
| Threshold 500
|
|--Input from file(s)
| [ ] Confirm before processing each file
| [ ] Confirm before processing read/only files
| [ ] Delete input files after processing
| Skip binary files
| Sample size 100 characters
|
|--Perl pattern [<meta_data><ASIN>(.*)</ASIN><title>(.*)</title><authors>(.*)</authors><publishers>(.*)</publishers><publication_date>(\d{4}-\d{2}-\d{2}).*</publication_date>.*></meta_data>] with [$1\t$2\t$3\t$4\t$5]
| | [ ] Match case
| | [ ] Whole words only
| | [ ] Case sensitive replace
| | [ ] Prompt on replace
| | [ ] Skip prompt if identical
| | [ ] First only
| | [ ] Extract matches
| | Maximum text buffer size 4096
| | [ ] Maximum match (greedy)
| | [ ] Allow comments
| | [X] '.' matches newline
| | [ ] UTF-8 Support
| |
| |--Replace [&] with [&]
| | [ ] Match case
| | [ ] Whole words only
| | [ ] Case sensitive replace
| | [ ] Prompt on replace
| | [ ] Skip prompt if identical
| | [ ] First only
| | [ ] Extract matches
| |
| |--Replace [>] with [>]
| | [ ] Match case
| | [ ] Whole words only
| | [ ] Case sensitive replace
| | [ ] Prompt on replace
| | [ ] Skip prompt if identical
| | [ ] First only
| | [ ] Extract matches
| |
| |--Replace [<] with [<]
| | [ ] Match case
| | [ ] Whole words only
| | [ ] Case sensitive replace
| | [ ] Prompt on replace
| | [ ] Skip prompt if identical
| | [ ] First only
| | [ ] Extract matches
| |
| +--Replace ["] with ["]
| [ ] Match case
| [ ] Whole words only
| [ ] Case sensitive replace
| [ ] Prompt on replace
| [ ] Skip prompt if identical
| [ ] First only
| [ ] Extract matches
|
+--Output to file(s)
[ ] Only update date on changed files
[ ] Append mode
[ ] Change extension to: .txt
[ ] Open output file
Only output modified files Backup mode [ ] Remove empty output files
Files List
----------
J:\DYNALT\Amazon\KindleSyncMetadataCache.xml
Use the line below to remove common non-text files from website processing
.[ 'gif' or 'png' or 'jpg' or 'bmp' or 'avi' or 'ico' or 'mp3', lineEnd ]
Use the line below to remove common non-text folders from website processing
_vti
- DataMystic Support
- Site Admin
- Posts: 2227
- Joined: Mon Jun 30, 2003 12:32 pm
- Location: Melbourne, Australia
- Contact:
Re: How are subfilters handled?
Works fine here - turn on prompt on replace for every search filter so that you can see where the issue is.
I used test text of:
I used test text of:
Code: Select all
<meta_data><ASIN>dsfsdj & </ASIN><title>dsfsdj & </title><authors>
dsfsdj &
</authors><publishers>
dsfsdj &
</publishers><publication_date>2015-01-01
</publication_date>
other guff
<other></other></meta_data>