Page 1 of 1

Remove duplicate content inside unique data blocks

Posted: Sat Sep 19, 2009 1:22 am
by alnico
Hi,

How to remove duplicate tables where the structure of the table and content of each cell is the same, while ignoring any attributes, id numbers, etc within tags.

Here are five tables [input], four have the same 'content' (not in order)...I want to remove all but one, but..one has a <div> tag that makes the structure non-identical (so only three are identical in content AND structure).
I would like to retain the table order after the duplicates are removed (keep the first duplicate) [output].
Note: I need to retain one duplicate table with all its tag attributes...(otherwise I could remove these attributes and then put everything on a single line, sort and remove...um maybe there is a way to capture the attributes and re-insert at end???)

Any ideas on how to accomplish this?

Thanks,
Brent

Input:

Code: Select all

<table>
	<tr id="1">
		<td id="1">
			<content>XXX</content>
		</td>
	</tr>
	<tr id="2">
		<td id="1">
			<content>XXX</content>
		</td>
	</tr>
</table>

<table>
	<tr id="3">
		<td id="2">
			<content>XXX</content>
		</td>
	</tr>
	<tr id="4">
		<td id="2">
			<content>XXX</content>
		</td>
	</tr>
</table>

<table>
	<tr id="5">
		<td id="3">
			<content>X</content>
		</td>
	</tr>
	<tr id="6">
		<td id="3">
			<content>X</content>
		</td>
	</tr>
</table>

<table>
	<tr id="7">
		<td id="4">
			<content>XXX</content>
		</td>
	</tr>
	<tr id="8">
		<td id="4">
			<content>XXX</content>
		</td>
	</tr>
</table>

<table>
	<div>
		<tr id="9">
			<td id="5">
				<content>XXX</content>
			</td>
		</tr>
	</div>
	<tr id="10">
		<td id="5">
			<content>XXX</content>
		</td>
	</tr>
</table>
Output:

Code: Select all

<table>
	<tr id="1">
		<td id="1">
			<content>XXX</content>
		</td>
	</tr>
	<tr id="2">
		<td id="1">
			<content>XXX</content>
		</td>
	</tr>
</table>

<table>
	<tr id="5">
		<td id="3">
			<content>X</content>
		</td>
	</tr>
	<tr id="6">
		<td id="3">
			<content>X</content>
		</td>
	</tr>
</table>

<table>
	<div>
		<tr id="9">
			<td id="5">
				<content>XXX</content>
			</td>
		</tr>
	</div>
	<tr id="10">
		<td id="5">
			<content>XXX</content>
		</td>
	</tr>
</table>

Re: Remove duplicate content inside unique data blocks

Posted: Thu Sep 24, 2009 1:12 am
by alnico
I have figured out a way to do this...

Put tables on single line
Add line number for sorting and ID
Duplicated each table and tag one of them
Remove non-comparable text from one table
Sort and remove duplicates
Find and extract matches, keeping the original table

Filter attached for anybody to use.

Brent

Re: Remove duplicate content inside unique data blocks

Posted: Thu Sep 24, 2009 6:21 am
by DataMystic Support
Thanks Brent - scary what you can achieve!