Man, been down this path for a long while. It gets tough! Flattening csvs with hierarchical headers (as in, headers that that apply a category to a second row of headers) are tough.
The ways csv can fail is just fucking nuts. Especially when they're half hand written, half automated, or where a failure is 20m rows in. Hard to have speed and strong checks simultaneously.
Yes you are right..In YoBulk we flatten the CSV to a JSON schema store it in a document DB and do all the validations.Chunking the CSV and analysing the stream buffers for validation is giving us speed also.
Have you been able to get something that might match a relational database? Auto-generation of a relational schema from a large dataset, or multiple datasets, is a deeply interesting idea.
Would you really pay for this? I made one for a client analyzing some X million line csvs by sharding the records then computing across 500 lambda instances to arrive at a schema