undefined | Better HN

0 pointschaps3y ago0 comments

Man, been down this path for a long while. It gets tough! Flattening csvs with hierarchical headers (as in, headers that that apply a category to a second row of headers) are tough.

The ways csv can fail is just fucking nuts. Especially when they're half hand written, half automated, or where a failure is 20m rows in. Hard to have speed and strong checks simultaneously.

0 comments

4 comments · 1 top-level

yosai3y ago· 3 in thread

Yes you are right..In YoBulk we flatten the CSV to a JSON schema store it in a document DB and do all the validations.Chunking the CSV and analysing the stream buffers for validation is giving us speed also.

chapsOP3y ago

Have you been able to get something that might match a relational database? Auto-generation of a relational schema from a large dataset, or multiple datasets, is a deeply interesting idea.

anonymouse0083y ago

Would you really pay for this? I made one for a client analyzing some X million line csvs by sharding the records then computing across 500 lambda instances to arrive at a schema

1 more reply

yosai3y ago

Ohh Yes..You are spot on. It's there in our upcoming release.Stay tuned please.

j / k navigate · click thread line to collapse