Page 1 of 2

Alternative to processing multiple files in succession

Posted: Wed Apr 28, 2010 11:25 pm
by alnico
Hello,

Although I have made this same feature request in the past (I understand the difficulty), I will propose a different approach (it came to me in a dream ;-).

I have many instances where I processes output from previous TextPipe files in succession.
It would save a lot of headaches if the process could be implemented all within a singe TextPipe file.

Currently I run in order:
TextPipe file #1: process and save output to file.
TextPipe file #2: load output file from #1, process and save output to file.
TextPipe file #3: load output file from #2, process and save output to file.
etc.

Here is an idea for processing all within the same TextPipe file.

Create a new special filter called: 'End/Re-Start'
How it would work:
1. All files for processing would load as usual and begin processing as normal.
2. Once a 'End/Re-Start' filter is encountered, TextPipe closes all open files as if it were done.
3. It then re-starts the entire file (loads all files, etc,) except this time processing starts right after the 'End/Re-Start' it stopped at.
4. It would only process within this block/end of file.
Files to process by the block would simply be controlled using the "Restrict to Filename" filter.
If possible, maybe all variables in memory can be retained between re-starts?
I am guessing this filter cannot be a sub-filter (meaning it can only sit on the main flow line).

This is like stringing together the files that would normally exist on their own. Because we already have the "Restrict to Filename" and "Secondary Output" filters for use...It seems this process (to me anyway) would be quite elegant ;-)

Is this feasible? It would be super powerful and save a lot of headaches of running multiple files in succession.

Thanks,
Brent

Re: Alternative to processing multiple files in succession

Posted: Thu Apr 29, 2010 8:52 am
by DataMystic Support
Hi Brent,

To me, this sounds like using a single master filter which controls the filenames, with 2 or more filters linked in using Link to filter...

Or am I missing something?

Re: Alternative to processing multiple files in succession

Posted: Thu Apr 29, 2010 1:08 pm
by alnico
Yes similar to 'Link to filter' I think...but with a significant difference. TextPipe loads all files initially and only closes them all after all processing has stopped. I am proposing dynamic loading and closing of files on demand.

That is what this filter ('End/Re-Start') would attempt to solve.
Say we had a Textpipe file structured like this:

Load files to process
Restrict to filenames
Filters
Secondary output filter
End/Re-Start
Restrict to filenames
Filters
Secondary output filter
End/Re-Start
Restrict to filenames
Filters
Secondary output filter

TextPipe would start as normal, load files to process and process filters until it encountered End/Re-Start.
At End/Re-Start TextPipe would stop and close all open files.
TextPipe would then re-start, load all files to process again (a file may have new content this time) and resume processing after the End/Re-Start from where it ended.
This process could be repeated many times, each block in succession.

The significance here is that, you can write content to a file, close it, and further down the stream load that same file with it's new content for further processing.
This idea, retains your open/close infrastructure and utilizes existing filters (restrict to filename and secondary output filters).
It only stops and re-starts at different locations.

Thanks,
Brent

Re: Alternative to processing multiple files in succession

Posted: Thu Apr 29, 2010 3:00 pm
by DataMystic Support
But the point of TextPipe is to keep each file open as a stream (or 'pipe') and do everything to it in one step before writing it out to disk.

TextPipe sends one file at a time through the filters and then closes each one as it is done.
process filters until it encountered End/Re-Start.
- this doesn't really make sense, because in TextPipe there may be pieces of text from the file inside every single filter at any point in time. e.g. chunk 1 is in the 3rd filter, chunk 2 is in the 2nd filter, chunk 3 is in the first filter..

Re: Alternative to processing multiple files in succession

Posted: Fri Apr 30, 2010 1:43 am
by alnico
Yes, I understand that TextPipe keeps each file open and runs till completion then close them....but...I am suggesting stringing multiple TextPipe files together where content can be written to a file and then reopened (by way of a restart).

The blocks between 'End/Re-Start' would be completely independent of each other; they would not share anything and process on their own completely (just as I currently have them as separate TextPipe files that I need to open and run each one, one after another is order...as each TextPipe instance uses input from the previous TextPipe file output).

Maybe I am missing something...but this is simply about stopping and restarting from last position (closing and reloading files).

Thanks,

Brent

Re: Alternative to processing multiple files in succession

Posted: Fri Apr 30, 2010 9:16 am
by DataMystic Support
Hi Brent,

I still don't get it (but then, it's 9am on a Friday... :-) )

As long as the number of files stays the same (ie you are not splitting or merging files together, or creating extra files with Secondary Output Filters), then there is no reason why you can't use one master filter with Links to each child filter to do all the processing.

Can you please give me a concrete example so I can understand better? Have you looked at the Link filter?

Re: Alternative to processing multiple files in succession

Posted: Sat May 01, 2010 9:03 pm
by MJ1234
I second that idea. I repeatedly grapple with the same wish, until I realize again that current Textpipe filters are not designed that way.

What would be needed is a filter aka "program" within Textpipe that allows you to load and run several Textpipe filters. Each of these filters will load and save different file names, but they work in linear fashion like in a highly automated factory. The second filter starts its work on the files saved by the first filter, the third filter manipulates the files saved by the second filter etc.

The facility to let the different filters do their work on differents sets of files seems to be the item that is currently missing. Currently a link to another filters still restricts this filter to deal with the identical set of files.

I try to have a multistep data manipulation process with a windows batch file, but why shouldn't it be possible to get the same result within Textpipe with a single filter or a program if that's the better name what we are aiming at.

MJ

Re: Alternative to processing multiple files in succession

Posted: Sat May 01, 2010 10:43 pm
by alnico
Hi Simon,

Here is an example: To translate a static webpage or other document into another language:

I have a translations file: Translations_english-spanish-RAW.txt ------content example: "The ^red^ fox","El zorro ^rojo^"
*.htm (english files that I want to translate to spanish...will search for "The ^red^ fox" and replace with "El zorro ^rojo^")
Note: ^ is a marker that represents any inline tag (formatting, image, etc)

But, before I can do the search and replace I must re-form my search/replace strings in the Translations_english-spanish-RAW.txt file.
"The ^red^ fox","El zorro ^rojo^" becomes
"The (<[^>]+>)red(<[^>]+>) fox","El zorro $1rojo$2"
Then save to Translations_english-spanish-REPLACEMENTS.txt (new file)

Now I can use the Translations_english-spanish-REPLACEMENTS.txt and run the replacements against the *.htm files

Currently, to do this with TextPipe I have to have two TextPipe files that I run i succession:
1-Create Translations_english-spanish-REPLACEMENTS.FLL
  • Files to process: Translations_english-spanish-RAW.txt
    Output file: 1-Translations_english-spanish-REPLACEMENTS.txt
2-Insert translations.FLL
  • Files to process: *.htm
    Replacement list: 1-Translations_english-spanish-REPLACEMENTS.txt
Why two files?...because TextPipe CAN send output to a secondary output file for creating the Translations_english-spanish-REPLACEMENTS.txt
But, TextPiple CANNOT use this same file (with new content) via the Replacement List filter to do replacements for all the *.htm files
Again, this is because TextPipe does not close files until the end.

So, by stopping TextPipe and Restarting it...we can achieve a reload of the Translations_english-spanish-REPLACEMENTS.txt,
This would allow us to run blocks of filters one after another using Restrict to Filename, and Secondary Output filter for flow control.
To do this: TextPipe would need a new filter called "End/Re-start"; actually maybe a better name is needed: "Stop: Reload Files/Restart Here"
When TextPipe encounters this filter it simply stops and closes all files. Once files are closed, TextPipe restarts, load all files (this time the files may have new content) and starts processing from the Restart Here mark. (you would simply have to track these marks, so you know where to restart)

Simple right ;-)...actually in concept it is, I don't believe this really changes TextPipe's current architecture (from what I know)...TextPipe simply has to be able to:
Stop at location (close all files)
Start (open all files) begin process at previous Stop marker.

This is just a couple of simplified parts of this process, I have 9 extremely complex filters that I run in succession for this job...but I always have to mess around with copy and paste input filenames across some of the filters...and then manually run them in succession, argh ;-(

I would be curious to know how many people are running multiple TextPipe files in session to complete a task? Anyone ;-)

So can this be done?...come-on, I need a good reason to upgrade ;-)

Thanks,
Brent

PS...On a related note:
I did request that the Link to Filter be able to do this a few years ago (was an email communication), as I expected the Link to Filter to run independently of its parent...but it does not...it is a hybrid. If such a Link to Filter could be built to run independently of its parent, (close all its files before returning to the parent) then that would be the best solution...but you said at the time that it could not be done.

Re: Alternative to processing multiple files in succession

Posted: Mon May 03, 2010 1:06 pm
by DataMystic Support
Thanks Brent - I now understand exactly what you need :-)

We've tossed around ideas before about a high-level TextPipe job control language.

In its simplest form it would simply invoke filters 1,2,3...x in sequence.
Options
  • To take the output files from stage x and feed it into stage x+1.
  • To log all output from each filter to one common log file
  • Email on completion?
The file (.tjc? suggestions for naming?) would be an XML file for easy editing and future extensions, but naturally TextPipe would provide a UI for this as well.

Any thoughts, suggestions, improvements..?

Re: Alternative to processing multiple files in succession

Posted: Tue May 04, 2010 7:46 am
by alnico
Great Simon, I am glad we are now on the same page ;-)

So, I am guessing that my idea: "Stop: Reload Files/Restart Here" isn't the direction we want to go with this; I concur that if you build a separate flow control process...then that would be even more powerful...as we could re-use filters at different points in a single process.

Here is what I am interpreting from what you wrote: there would be a new Textpipe file type: *.tjc
  • Users would create it as such: File/New (two options: 'New Filter' or 'New Job Control').
    If New Job Control is chosen then UI where: filters would simply be inserted on a flow line, and of course run in that order.
    Logging to a common file.
    Email on completion.
    Allow not only fll files to sit on the flow line but other tjc files as well?
    It may be beneficial to maintain variables between filters...not sure if that would cause problems though?
    Ability to to enable/disable filters.
    Double-clicking a filter would open that filter for editing, etc.
    On completion, open any file on computer with default application (unrelated to TextPipe, but could be an application that interprets the data that TextPipe just processed).
    Run external program and return to TextPipe on completion. Example, right now I create an xml file in TextPipe...I then run it through the Saxon xlt processor (TextPipe doesn't support this one, so it is an external program) and save output, I then return to TexPipe to process the Saxon output file. (I don't know if you can control or 'know' when another application is done running a process?)
I don't think we need many options here as this is simply to run the already powerful filters in sequence.
Although, I will say that in order for this processing to cover all that we want it to do...I suggest this addition to filters: that 'Files to Process' can be loaded 'From file', just like you do for other filters. I am guessing it would require a 3 column delimited structure (Filename, Subfolders, Action (maybe we don't need the Action)). If this is possible then the entire flow of data using filters can be dynamically loaded and processed.

You wanted options/suggestions...there they are ;-)

One problem I see: if 'Open Output on Completion' is activated for a single filter it will stop the process. Could create a warning or .tjc would simply ignore this setting during processing.

Naming:
*.tjc (TextPipe Job Control)...that works for me or...
*.tfc (TextPipe Flow Control) (I think I like this...it somewhat describes what it would do)
*.tsc (TextPipe Sequencing Control)
*.fsc (Filter Sequencing Control) (I like this one too)
I haven't looked, but picking an extension that isn't popular would be best.

We're onto something great here,

I hope I have interpreted you idea correctly ;-)

Thanks,
Brent

Re: Alternative to processing multiple files in succession

Posted: Wed May 05, 2010 3:14 pm
by DataMystic Support
Sounds great - thanks so much for your extra input.

Now we just have to find time to do it! We have a lot of jobs on at the moment - so I expect this to take 2 weeks.

Re: Alternative to processing multiple files in succession

Posted: Wed May 05, 2010 10:33 pm
by alnico
Thanks Simon,

If you want me to do any testing or need other input...let me know; I am willing to help out any way I can.
This definitely will take TextPipe to a whole 'nother level.

Thank you,
Brent

Re: Alternative to processing multiple files in succession

Posted: Thu May 06, 2010 8:23 am
by DataMystic Support
Thank you!

Re: Alternative to processing multiple files in succession

Posted: Sun Oct 24, 2010 10:50 pm
by alnico
Hi,

I am ramping up to start another major rip and transform project where I need this functionality.

Is this idea we discussed 5 months ago still a go?

Thanks,

Re: Alternative to processing multiple files in succession

Posted: Mon Oct 25, 2010 10:59 am
by DataMystic Support
Hi Brent,

Unfortunately we've been incredibly bogged down with performance issues on our download site, www.downloadpipe.com, and these have been taking priority. We're doing bug fixes and small changes to TP at the moment but nothing as major as what you've proposed yet. The next item on our list is DetachPipe, then onto TextPipe again.

At this stage it might be worth you looking at scripting the TextPipe filter from a .js or .vbs script - you can easily get the filenames from one stage and feed it into the next.