Validating large ~600GB CSV against schema?
Posted: Sun Mar 10, 2019 7:07 pm
Hello! I used TextPipe many years ago, and now a new potential use-case has risen again.
Would I be able to use TextPipe to validate a ~600GB CSV file against its schema? I'm trying to ingest the data to AWS Redshift, but the data set (over 280,000,000 records) has quite a few records that apparently don't match the given schema, and the COPY (ingest to Redshift) fails with "separator not found", even after I maxed out the COPY's MAXERROR variable. The schema has about 850 rows, most of which are VARCHARs with a handful of INTEGER and DOUBLE PRECISION fields.
If I can accomplish this easier with TextPipe than writing an ad-hoc script for the purpose, I'd rather save time and use TextPipe!
Thanks for any insights on this!
Would I be able to use TextPipe to validate a ~600GB CSV file against its schema? I'm trying to ingest the data to AWS Redshift, but the data set (over 280,000,000 records) has quite a few records that apparently don't match the given schema, and the COPY (ingest to Redshift) fails with "separator not found", even after I maxed out the COPY's MAXERROR variable. The schema has about 850 rows, most of which are VARCHARs with a handful of INTEGER and DOUBLE PRECISION fields.
If I can accomplish this easier with TextPipe than writing an ad-hoc script for the purpose, I'd rather save time and use TextPipe!
Thanks for any insights on this!