Page 1 of 1

Validating large ~600GB CSV against schema?

Posted: Sun Mar 10, 2019 7:07 pm
by Ville
Hello! I used TextPipe many years ago, and now a new potential use-case has risen again.

Would I be able to use TextPipe to validate a ~600GB CSV file against its schema? I'm trying to ingest the data to AWS Redshift, but the data set (over 280,000,000 records) has quite a few records that apparently don't match the given schema, and the COPY (ingest to Redshift) fails with "separator not found", even after I maxed out the COPY's MAXERROR variable. The schema has about 850 rows, most of which are VARCHARs with a handful of INTEGER and DOUBLE PRECISION fields.

If I can accomplish this easier with TextPipe than writing an ad-hoc script for the purpose, I'd rather save time and use TextPipe! :-)

Thanks for any insights on this!

Re: Validating large ~600GB CSV against schema?

Posted: Tue Mar 12, 2019 6:58 am
by dfhtextpipe
What exactly does a schema mean in the context of a CSV file?

Presumably, it's to validate the syntax of each field in the data?

cf. I'm only familiar with using a schema in the context of validating XML files.

David
TextPipe user since 2001

Re: Validating large ~600GB CSV against schema?

Posted: Thu Mar 14, 2019 1:17 pm
by DataMystic Support
Hi there,

One approach you could use with REDSHIFT import is to REMOVE lines that do not have 850 fields in them, then treat these exceptions differently.

The file size of 600GB is not an issue for TextPipe.

Use Filter Library\Extract\Extract lines not matching (inverse grep)

With an EasyPattern of:

[ 849 (CSVfield, comma), CSVfield, cr, lf ]

What is left after this process is your dud data.

To test this on a smaller file fragment, first use the filter
Filter Library\Remove\Start or end of file

and get it to include 2000000 bytes from the start of the file.


Simon