ID unique/duplicate lines
Posted: Fri Apr 15, 2011 2:05 am
I am ripping data to create a relational database.
When I rip a column/field from my source file…it needs to be de-duplicated and then an ID assigned (add line number)…pretty simple.
I also, must create a replace list that will find the string in the source file and replace it with the ID. (running a total of two fll in succession).
Now this gets a little complex when fields overlap and share many ID’s.
Anyway, I am looking for a way to simplify this whole process. Here is what I am thinking…
A new filter ('ID lines') based on the ‘Count duplicate lines’ filter.
The filter would simply find all unique/duplicate lines and assign a number (ID)…it would only increment the ID number for each unique match. In the Output format field, replace %d with ‘The ID’)
Although this ID is just a number, it would be easy to wrap it for identification for later processing, i.e. FName=%d\t%s\t/Fname
Here is an example…the first three ‘ID lines’ process single fields…the fourth processes two fields:
Input (tab delimited for clarification):
Output after ‘ID lines’ first field:
Output after ‘ID lines’ second field:
Output after ‘ID lines’ third field:
Output after ‘ID lines’ first and second field:
I would then rip all fields and create individual output for database import which would look like this:
TableName=FirstNames
TableName=LastNames
TableName=Country
TableName=FirstLastName (shows relationships to tables: FirstNames and LastNames)
This may look complicated...but I don't think it is...padding the fields is the biggest issue.
Regards,
Brent
On another related note; it would be sweet if the ‘Count/Remove duplicate lines’ and similar filters could sit as a subfilter of the ‘Restrict to delimited fields’…then we would not have to worry about padding, etc. (maybe this could happen with current filters; Logic…‘IF Count/Remove duplicate lines and similar filters' IS a subfilter of ‘Restrict to delimited fields’ gray out/ignore Start column and Length). Just an idea, I have no idea of other implications
When I rip a column/field from my source file…it needs to be de-duplicated and then an ID assigned (add line number)…pretty simple.
I also, must create a replace list that will find the string in the source file and replace it with the ID. (running a total of two fll in succession).
Now this gets a little complex when fields overlap and share many ID’s.
Anyway, I am looking for a way to simplify this whole process. Here is what I am thinking…
A new filter ('ID lines') based on the ‘Count duplicate lines’ filter.
The filter would simply find all unique/duplicate lines and assign a number (ID)…it would only increment the ID number for each unique match. In the Output format field, replace %d with ‘The ID’)
Although this ID is just a number, it would be easy to wrap it for identification for later processing, i.e. FName=%d\t%s\t/Fname
Here is an example…the first three ‘ID lines’ process single fields…the fourth processes two fields:
Input (tab delimited for clarification):
Code: Select all
Joe Jones Germany
Joe Jones Australia
Joe Johnson USA
Tim Johnson USA
Tim Jones Germany
Fred Smith
Code: Select all
FName=1 Joe /Fname Jones Germany
FName=1 Joe /Fname Jones Australia
FName=1 Joe /Fname Johnson USA
FName=2 Tim /Fname Johnson USA
FName=2 Tim /Fname Jones Germany
FName=3 Fred /Fname Smith
Code: Select all
FName=1 Joe /Fname LName=1 Jones /Lname Germany
FName=1 Joe /Fname LName=1 Jones /Lname Australia
FName=1 Joe /Fname LName=2 Johnson /Lname USA
FName=2 Tim /Fname LName=2 Johnson /Lname USA
FName=2 Tim /Fname LName=1 Jones /Lname Germany
FName=3 Fred /Fname LName=3 Smith /Lname
Code: Select all
FName=1 Joe /Fname LName=1 Jones /Lname Country=1 Germany /Country
FName=1 Joe /Fname LName=1 Jones /Lname Country=2 Australia /Country
FName=1 Joe /Fname LName=2 Johnson /Lname Country=3 USA /Country
FName=2 Tim /Fname LName=2 Johnson /Lname Country=3 USA /Country
FName=2 Tim /Fname LName=1 Jones /Lname Country=1 Germany /Country
FName=3 Fred /Fname LName=3 Smith /Lname
Code: Select all
FLName=1 FName=1 Joe /Fname LName=1 Jones /Lname /FLName Country=1 Germany /Country
FLName=1 FName=1 Joe /Fname LName=1 Jones /Lname /FLName Country=2 Australia /Country
FLName=2 FName=1 Joe /Fname LName=2 Johnson /Lname /FLName Country=3 USA /Country
FLName=3 FName=2 Tim /Fname LName=2 Johnson /Lname /FLName Country=3 USA /Country
FLName=4 FName=2 Tim /Fname LName=1 Jones /Lname /FLName Country=1 Germany /Country
FLName=5 FName=3 Fred /Fname LName=3 Smith /Lname /FLName
I would then rip all fields and create individual output for database import which would look like this:
TableName=FirstNames
Code: Select all
ID FName
1 Joe
2 Tim
3 Fred
Code: Select all
ID LName
1 Jones
2 Johnson
3 Smith
Code: Select all
ID Country
1 Germany
2 Australia
3 USA
Code: Select all
ID FNameID LNameID
1 1 1
2 1 2
3 2 2
4 2 1
5 3 3
Regards,
Brent
On another related note; it would be sweet if the ‘Count/Remove duplicate lines’ and similar filters could sit as a subfilter of the ‘Restrict to delimited fields’…then we would not have to worry about padding, etc. (maybe this could happen with current filters; Logic…‘IF Count/Remove duplicate lines and similar filters' IS a subfilter of ‘Restrict to delimited fields’ gray out/ignore Start column and Length). Just an idea, I have no idea of other implications