Validating large ~600GB CSV against schema?

Get help with installation and running here.

Moderators: DataMystic Support, Moderators, DataMystic Support, Moderators, DataMystic Support, Moderators

Post Reply
Ville
Posts: 3
Joined: Thu Nov 20, 2008 7:52 am
Contact:

Validating large ~600GB CSV against schema?

Post by Ville »

Hello! I used TextPipe many years ago, and now a new potential use-case has risen again.

Would I be able to use TextPipe to validate a ~600GB CSV file against its schema? I'm trying to ingest the data to AWS Redshift, but the data set (over 280,000,000 records) has quite a few records that apparently don't match the given schema, and the COPY (ingest to Redshift) fails with "separator not found", even after I maxed out the COPY's MAXERROR variable. The schema has about 850 rows, most of which are VARCHARs with a handful of INTEGER and DOUBLE PRECISION fields.

If I can accomplish this easier with TextPipe than writing an ad-hoc script for the purpose, I'd rather save time and use TextPipe! :-)

Thanks for any insights on this!
dfhtextpipe
Posts: 988
Joined: Sun Dec 09, 2007 2:49 am
Location: UK

Re: Validating large ~600GB CSV against schema?

Post by dfhtextpipe »

What exactly does a schema mean in the context of a CSV file?

Presumably, it's to validate the syntax of each field in the data?

cf. I'm only familiar with using a schema in the context of validating XML files.

David
TextPipe user since 2001
David
User avatar
DataMystic Support
Site Admin
Posts: 2227
Joined: Mon Jun 30, 2003 12:32 pm
Location: Melbourne, Australia
Contact:

Re: Validating large ~600GB CSV against schema?

Post by DataMystic Support »

Hi there,

One approach you could use with REDSHIFT import is to REMOVE lines that do not have 850 fields in them, then treat these exceptions differently.

The file size of 600GB is not an issue for TextPipe.

Use Filter Library\Extract\Extract lines not matching (inverse grep)

With an EasyPattern of:

[ 849 (CSVfield, comma), CSVfield, cr, lf ]

What is left after this process is your dud data.

To test this on a smaller file fragment, first use the filter
Filter Library\Remove\Start or end of file

and get it to include 2000000 bytes from the start of the file.


Simon
Post Reply