Restrict to HTML or XML element or attribute

Get help with installation and running here.

Moderators: DataMystic Support, Moderators, DataMystic Support, Moderators, DataMystic Support, Moderators

Post Reply
dfhtextpipe
Posts: 986
Joined: Sun Dec 09, 2007 2:49 am
Location: UK

Restrict to HTML or XML element or attribute

Post by dfhtextpipe »

I think there should be a very useful enhancement feasible for the filter Restrict to between tags.

Suppose one has an XML file containing numerous similar XML div elements,
each div identified in its start element by the same attribute having a different value.

The screenshot illustrates the concept. Notepad++ XML partially folded view.
https://www.dropbox.com/s/3fkml1dx3ptxe ... 5.png?dl=0

Suppose you wish to restrict the subfilters to operate on just one (or a few) of these divs.

The UI form for the start tag could be optionally extended to the right, so as to enable the user to specify an attribute name and a corresponding value pattern.

Notes:
  • This concept is not the same as the Restrict to attribute filter.
  • The named attribute should not need to be the sole attribute in the start tag.
  • Its position within the tag should not matter.
  • The value pattern may be Exact, Perl, Easypattern, etc., each with associated Options, where applicable.
Thoughts?

Best regards,

David
David
User avatar
DataMystic Support
Site Admin
Posts: 2227
Joined: Mon Jun 30, 2003 12:32 pm
Location: Melbourne, Australia
Contact:

Re: Restrict to HTML or XML element or attribute

Post by DataMystic Support »

Sure, I agree. To be most useful, it should count the number of <yyy .... zzzz=gggg .... > tags to ensure it matches the correct end </yyy> tag.
dfhtextpipe
Posts: 986
Joined: Sun Dec 09, 2007 2:49 am
Location: UK

Re: Restrict to HTML or XML element or attribute

Post by dfhtextpipe »

Simon wrote,
To be most useful, it should count the number of <yyy .... zzzz=gggg .... > tags to ensure it matches the correct end </yyy> tag.
This doesn't really make sense! There's nothing that needs counting, surely?

The fact that a start tag has a certain attribute value doesn't make any difference to the XML structure.
The XML element start and end tags will still match just as they already do.
Otherwise the XML would not pass syntax check.

Best regards,

David
David
User avatar
DataMystic Support
Site Admin
Posts: 2227
Joined: Mon Jun 30, 2003 12:32 pm
Location: Melbourne, Australia
Contact:

Re: Restrict to HTML or XML element or attribute

Post by DataMystic Support »

Consider

Code: Select all

<div attr="value">
	<div> ...
	</div>
	<div> ...
	</div>
	<div> ...
	</div>
...
</div>
Ideally you want a Restrict to between tags filter to match the outer set of divs, not stop at the first </div>
dfhtextpipe
Posts: 986
Joined: Sun Dec 09, 2007 2:49 am
Location: UK

Re: Restrict to HTML or XML element or attribute

Post by dfhtextpipe »

I always assumed that the existing Restrict to between tags filter was already fully capable of matching in accordance with the XML hierarchy.

If that's not the case, then we're in deep trouble!
  • A greedy match could span too much.
  • A non-greedy match could span too little.
As my proposal is a for subset of the existing filter and would be implemented by adding to the UI a tick box, and two edit boxes,
what is the reason for any extra counting?

Currently a Restrict to between tags filter with rdg as the tag name logs like this:

Code: Select all

2016-03-12 10:16:05,Info,0 replace(s) performed for pattern match [<rdg(?:\s+[\w\:\-]+(?:\s*=\s*(?:(?:'(?:[^'<>]|''|\\')*+')|(?:"(?:[^"<>]|""|\\")*+")|(?:[^>'"\s]*)))?\s*)*\s*>]
2016-03-12 10:16:05,Info,0 replace(s) performed for pattern match [<\/rdg[>'"\s]]
2016-03-12 10:16:05,Info,0 replace(s) performed for pattern match [####MARKER####(.*)####MARKER####]
I have assumed that the complicated PCRE pattern on the first line inserts the starting "####MARKER####"
and the one on the second line inserts the ending one.
I have assumed that this takes care of the XML hierarchy, and that the third line is how the restriction works.

Aside: Hard luck if your XML input file just happens to contain "####MARKER####", eh?


David
David
dfhtextpipe
Posts: 986
Joined: Sun Dec 09, 2007 2:49 am
Location: UK

Re: Restrict to HTML or XML element or attribute

Post by dfhtextpipe »

One question might arise when you have a hierarchy of XML elements with the same name and with a particular attribute.

Code: Select all

<div canonical="true" level="1" type="book" ....>
  ...
  <div canonical="false" level="2" type="majorSection" ....>
    ...
    <div canonical="true" level="3" type="section" ....>
      ...
    </div>
    ...
  </div>
  ...
</div>
And suppose you wish to restrict to between div tags and for canonical="true".

Should everything in the level 2 div element be excluded?
Or should the level 3 div element be included?

So yes, it can get complicated. But the question still remains,

How would restrict to between div tags work without considering the proposed enhancement?
e.g. If you wanted to restrict to the level 3 div elements, I imagine you'd already be using a hierarchy of restrict filters.


David
David
User avatar
DataMystic Support
Site Admin
Posts: 2227
Joined: Mon Jun 30, 2003 12:32 pm
Location: Melbourne, Australia
Contact:

Re: Restrict to HTML or XML element or attribute

Post by DataMystic Support »

Hi David,

No - the current matching filter does not count the levels. The ##marker## tags are definitely there to make it easier - the regex for a start and end element with various quoting and escaping styles for attributes is nasty.

To count the tags will require a state machine to parse the tags, which is not a huge issue, just a new item for my ever-growing list.
Post Reply