ID unique/duplicate lines

alnico · Post by **alnico** » Fri Apr 15, 2011 2:05 am

I am ripping data to create a relational database.

When I rip a column/field from my source file…it needs to be de-duplicated and then an ID assigned (add line number)…pretty simple.
I also, must create a replace list that will find the string in the source file and replace it with the ID. (running a total of two fll in succession).
Now this gets a little complex when fields overlap and share many ID’s.

Anyway, I am looking for a way to simplify this whole process. Here is what I am thinking…

A new filter ('ID lines') based on the ‘Count duplicate lines’ filter.
The filter would simply find all unique/duplicate lines and assign a number (ID)…it would only increment the ID number for each unique match. In the Output format field, replace %d with ‘The ID’)
Although this ID is just a number, it would be easy to wrap it for identification for later processing, i.e. FName=%d\t%s\t/Fname

Here is an example…the first three ‘ID lines’ process single fields…the fourth processes two fields:

Input (tab delimited for clarification):

Code: Select all

Joe	Jones	Germany
Joe	Jones	Australia
Joe	Johnson	USA
Tim	Johnson	USA
Tim	Jones	Germany
Fred	Smith

Output after ‘ID lines’ first field:

Code: Select all

FName=1	Joe	/Fname	Jones	Germany
FName=1	Joe	/Fname	Jones	Australia
FName=1	Joe	/Fname	Johnson	USA
FName=2	Tim	/Fname	Johnson	USA
FName=2	Tim	/Fname	Jones	Germany
FName=3	Fred	/Fname	Smith

Output after ‘ID lines’ second field:

Code: Select all

FName=1	Joe	/Fname	LName=1	Jones	/Lname	Germany
FName=1	Joe	/Fname	LName=1	Jones	/Lname	Australia
FName=1	Joe	/Fname	LName=2	Johnson	/Lname	USA
FName=2	Tim	/Fname	LName=2	Johnson	/Lname	USA
FName=2	Tim	/Fname	LName=1	Jones	/Lname	Germany
FName=3	Fred	/Fname	LName=3	Smith	/Lname

Output after ‘ID lines’ third field:

Code: Select all

FName=1	Joe	/Fname	LName=1	Jones	/Lname	Country=1	Germany	/Country
FName=1	Joe	/Fname	LName=1	Jones	/Lname	Country=2	Australia	/Country
FName=1	Joe	/Fname	LName=2	Johnson	/Lname	Country=3	USA	/Country
FName=2	Tim	/Fname	LName=2	Johnson	/Lname	Country=3	USA	/Country
FName=2	Tim	/Fname	LName=1	Jones	/Lname	Country=1	Germany	/Country
FName=3	Fred	/Fname	LName=3	Smith	/Lname

Output after ‘ID lines’ first and second field:

Code: Select all

FLName=1	FName=1	Joe	/Fname	LName=1	Jones	/Lname	/FLName	Country=1	Germany	/Country
FLName=1	FName=1	Joe	/Fname	LName=1	Jones	/Lname	/FLName	Country=2	Australia	/Country
FLName=2	FName=1	Joe	/Fname	LName=2	Johnson	/Lname	/FLName	Country=3	USA	/Country
FLName=3	FName=2	Tim	/Fname	LName=2	Johnson	/Lname	/FLName	Country=3	USA	/Country
FLName=4	FName=2	Tim	/Fname	LName=1	Jones	/Lname	/FLName	Country=1	Germany	/Country
FLName=5	FName=3	Fred	/Fname	LName=3	Smith	/Lname	/FLName

I would then rip all fields and create individual output for database import which would look like this:

TableName=FirstNames

Code: Select all

ID	FName
1	Joe
2	Tim
3	Fred

TableName=LastNames

Code: Select all

ID	LName
1	Jones
2	Johnson
3	Smith

TableName=Country

Code: Select all

ID	Country
1	Germany
2	Australia
3	USA

TableName=FirstLastName (shows relationships to tables: FirstNames and LastNames)

Code: Select all

ID	FNameID	LNameID
1	1	1
2	1	2
3	2	2
4	2	1
5	3	3

This may look complicated...but I don't think it is...padding the fields is the biggest issue.

Regards,
Brent

On another related note; it would be sweet if the ‘Count/Remove duplicate lines’ and similar filters could sit as a subfilter of the ‘Restrict to delimited fields’…then we would not have to worry about padding, etc. (maybe this could happen with current filters; Logic…‘IF Count/Remove duplicate lines and similar filters' IS a subfilter of ‘Restrict to delimited fields’ gray out/ignore Start column and Length). Just an idea, I have no idea of other implications