Remove duplicate content inside unique data blocks

Get help with installation and running here.

Moderators: DataMystic Support, Moderators, DataMystic Support, Moderators, DataMystic Support, Moderators

Post Reply
alnico
Posts: 74
Joined: Fri Oct 12, 2007 11:57 pm

Remove duplicate content inside unique data blocks

Post by alnico »

Hi,

How to remove duplicate tables where the structure of the table and content of each cell is the same, while ignoring any attributes, id numbers, etc within tags.

Here are five tables [input], four have the same 'content' (not in order)...I want to remove all but one, but..one has a <div> tag that makes the structure non-identical (so only three are identical in content AND structure).
I would like to retain the table order after the duplicates are removed (keep the first duplicate) [output].
Note: I need to retain one duplicate table with all its tag attributes...(otherwise I could remove these attributes and then put everything on a single line, sort and remove...um maybe there is a way to capture the attributes and re-insert at end???)

Any ideas on how to accomplish this?

Thanks,
Brent

Input:

Code: Select all

<table>
	<tr id="1">
		<td id="1">
			<content>XXX</content>
		</td>
	</tr>
	<tr id="2">
		<td id="1">
			<content>XXX</content>
		</td>
	</tr>
</table>

<table>
	<tr id="3">
		<td id="2">
			<content>XXX</content>
		</td>
	</tr>
	<tr id="4">
		<td id="2">
			<content>XXX</content>
		</td>
	</tr>
</table>

<table>
	<tr id="5">
		<td id="3">
			<content>X</content>
		</td>
	</tr>
	<tr id="6">
		<td id="3">
			<content>X</content>
		</td>
	</tr>
</table>

<table>
	<tr id="7">
		<td id="4">
			<content>XXX</content>
		</td>
	</tr>
	<tr id="8">
		<td id="4">
			<content>XXX</content>
		</td>
	</tr>
</table>

<table>
	<div>
		<tr id="9">
			<td id="5">
				<content>XXX</content>
			</td>
		</tr>
	</div>
	<tr id="10">
		<td id="5">
			<content>XXX</content>
		</td>
	</tr>
</table>
Output:

Code: Select all

<table>
	<tr id="1">
		<td id="1">
			<content>XXX</content>
		</td>
	</tr>
	<tr id="2">
		<td id="1">
			<content>XXX</content>
		</td>
	</tr>
</table>

<table>
	<tr id="5">
		<td id="3">
			<content>X</content>
		</td>
	</tr>
	<tr id="6">
		<td id="3">
			<content>X</content>
		</td>
	</tr>
</table>

<table>
	<div>
		<tr id="9">
			<td id="5">
				<content>XXX</content>
			</td>
		</tr>
	</div>
	<tr id="10">
		<td id="5">
			<content>XXX</content>
		</td>
	</tr>
</table>
alnico
Posts: 74
Joined: Fri Oct 12, 2007 11:57 pm

Re: Remove duplicate content inside unique data blocks

Post by alnico »

I have figured out a way to do this...

Put tables on single line
Add line number for sorting and ID
Duplicated each table and tag one of them
Remove non-comparable text from one table
Sort and remove duplicates
Find and extract matches, keeping the original table

Filter attached for anybody to use.

Brent
Attachments
Unique tables-remove content duplicates.zip
(988 Bytes) Downloaded 473 times
User avatar
DataMystic Support
Site Admin
Posts: 2227
Joined: Mon Jun 30, 2003 12:32 pm
Location: Melbourne, Australia
Contact:

Re: Remove duplicate content inside unique data blocks

Post by DataMystic Support »

Thanks Brent - scary what you can achieve!
Post Reply