Process inside compressed files

Get help with installation and running here.

Moderators: DataMystic Support, Moderators, DataMystic Support, Moderators, DataMystic Support, Moderators

Post Reply
dfhtextpipe
Posts: 986
Joined: Sun Dec 09, 2007 2:49 am
Location: UK

Process inside compressed files

Post by dfhtextpipe »

Process inside compressed files currently has only these file types: ZIP, DOCX, XLSX and PPTX.

How about adding support for OpenDocument ? e.g. ODT files, etc.

See http://en.wikipedia.org/wiki/OpenDocument

It would then be feasible to process the file content.xml inside an OpenDocument word processing file.

David
Last edited by dfhtextpipe on Wed Mar 30, 2011 9:22 pm, edited 1 time in total.
David
User avatar
DataMystic Support
Site Admin
Posts: 2227
Joined: Mon Jun 30, 2003 12:32 pm
Location: Melbourne, Australia
Contact:

Re: Process inside compressed files

Post by DataMystic Support »

Thanks David,

But
OpenDocument files can also take the format of a ZIP compressed archive containing a number of files and directories
So does this mean that the forms
# .odt for word processing (text) documents
# .ods for spreadsheets
# .odp for presentations
could be just XML, or could be a .zip file, optionally? Or are they always zip format these days?
dfhtextpipe
Posts: 986
Joined: Sun Dec 09, 2007 2:49 am
Location: UK

Re: Process inside compressed files

Post by dfhtextpipe »

Simon,

If from Word 2007, I save a file as OpenDocument format (file extension .odt),
I can readily examine the contents of the saved file using an archive manager such as 7-Zip.

The compressed file contains a content.xml file along with other files, etc.
See attached image that illustrates this.

David
Attachments
Inside an OpenDocument file.
Inside an OpenDocument file.
InsideODT.png (70.15 KiB) Viewed 15340 times
David
dfhtextpipe
Posts: 986
Joined: Sun Dec 09, 2007 2:49 am
Location: UK

Re: Process inside compressed files

Post by dfhtextpipe »

Further note:

content.xml is "linearized", in that all the XML (after the schema) is on a single line of text.

However, it can be made more legible using the "Pretty-print" feature of XML Copy Editor.

See attached image. See also http://xml-copy-editor.sourceforge.net/

David
Attachments
Pretty view of extracted content.xml
Pretty view of extracted content.xml
PrettyContentXML.png (105.8 KiB) Viewed 15340 times
David
dfhtextpipe
Posts: 986
Joined: Sun Dec 09, 2007 2:49 am
Location: UK

Re: Process inside compressed files

Post by dfhtextpipe »

The Notepad++ plugin called XML Tools is another means to "Pretty-print" an XML file, and it also has a "Linearize" option.

Just a further suggestion for your developers....

Maybe it would be nice if TextPad could be enhanced to also include such methods by means of various XML sub-filters.

David
David
User avatar
DataMystic Support
Site Admin
Posts: 2227
Joined: Mon Jun 30, 2003 12:32 pm
Location: Melbourne, Australia
Contact:

Re: Process inside compressed files

Post by DataMystic Support »

Thanks David - that is very detailed and very helpful.

We have added .ODT for the next release of TP.

Also - I have attached a sample XML Linearize filter.
Attachments
xml linearize.zip
Linearize XML files
(804 Bytes) Downloaded 495 times
dfhtextpipe
Posts: 986
Joined: Sun Dec 09, 2007 2:49 am
Location: UK

Re: Process inside compressed files

Post by dfhtextpipe »

Thanks Simon.

I think you may have the "XML linearize" terminology flipped!
A linearized XML file is one with everything (except the schema) as a single (very long) line.
A "Pretty Print" XML file is one where the XML is "de-linearized" and intelligently indented.

Examining the rudimentary XML Linearize filter, I observe that (as yet) it does not also apply any indenting.
Something to think about for the future, perhaps. Not urgent - I can still use XML Copy Editor.

Also, it would be sensible to tick Enable UTF-8 support in the Perl sub-filters.

David
David
User avatar
DataMystic Support
Site Admin
Posts: 2227
Joined: Mon Jun 30, 2003 12:32 pm
Location: Melbourne, Australia
Contact:

Re: Process inside compressed files

Post by DataMystic Support »

Thanks David - I will create a new 'XML pretty print' filter and create a new XML Linearize filter to simply replace all cr/lfs with space, and optionally to compress spaces.

I will also enable utf-8 support in those filters.
dfhtextpipe
Posts: 986
Joined: Sun Dec 09, 2007 2:49 am
Location: UK

Re: Process inside compressed files

Post by dfhtextpipe »

Simon,

When linearizing XML, please take care over the the XML schema.
Normally this should be on the first line of text.

Sometimes the definition lookup is spread over more than the first line of text.
Some XML validation tools fail when this is the case.

David
David
User avatar
DataMystic Support
Site Admin
Posts: 2227
Joined: Mon Jun 30, 2003 12:32 pm
Location: Melbourne, Australia
Contact:

Re: Process inside compressed files

Post by DataMystic Support »

I have attached updated filters - BTW, I know the pretty printer is far from pretty.
Would you like to check the schema linearizing? - if it is just between <> then it should be put on one line anyway.
Attachments
xml linearize and pretty print2.zip
(1.54 KiB) Downloaded 481 times
Post Reply