undefined | Better HN

0 pointsmtrn9y ago0 comments

> I've been thinking about this space a lot

Me too, for better or for worse.

As for the issues, there are many. Just quickly a few:

* Data provider has an FTP server, most files are automatically generated, some are hand-named (with inconsistencies). How do you handle (without a lot of effort) a list of exceptions along with the regular files?

* Data provider has a good strict XML schema, but the relevant information for a single item is spread across three files, inside a tar archive. Since the there are 500k files inside the archive, you best not want to extract it, but process it on the fly.

* Data provider chooses layout that saves every item in a single XML file, inside 2-3 levels of directories. There are 20M of them. Unzipping the archive alone takes more than a day with default system settings and the usual tools. How do you process these things fast?

There are more subtle issues as well:

* FFFD regularly occurs in natural language strings. Can you correct these strings?

* File has .csv ending, looks like CSV on first glance, but all the standard RFC compliant parsers choke on it.

* XML file that elements, that have RTF tags embedded in it. You need to parse the RTF in the elements, because there is relevant information there, that you need to add to the transformed version.

* Date issues. Inconsistent formats and almost-valid dates.

* Combine data, coming from an API with data fetched from ten different servers to produce a transformed version with a legacy command line application (that might be slow, so you have to split your data first and parallelize the work, combine it and make sure it's complete).

I am thinking about a longer article or even a short book about these kind of data handling and quality questions and what ways there are to address them. Would you read a book like this and what topic would be the most pressing or relevant?

0 comments

DenisM9y ago

That kind of book would be a great service to humanity. I don't know if you will sell many, but anyone inventing a new ETL tool would be served well by reading it. Perhaps a paper for a journal like ACM would be a better format. Or you could make it into a wiki. Or an "ETL Nightmares monthly" newsletter, with best user submissions.

voltagex_9y ago

This is what an "ETL" (Extract-Transform-Load) tool is for. Something like FME Server [1] would handle the first two points and the last point well.

For unzipping something that crazy, I'm interested in your solution - I think I'd have to write a custom zip library and use a RAMdisk or similar.

1: https://www.safe.com/fme/fme-server/

mtrnOP9y ago

Yes, that's ETL. Classic ETL dealt with databases, the modern variant has relaxed this constraint.

As for the zip: We simply "unzip -p" and stream process it carefully (with a custom program reading XML and transforming it). Cuts processing time from hours (extracting the zip and creating all directories, then visiting each file) to minutes (read from a single file).

j / k navigate · click thread line to collapse