Me too, for better or for worse.
As for the issues, there are many. Just quickly a few:
* Data provider has an FTP server, most files are automatically generated, some are hand-named (with inconsistencies). How do you handle (without a lot of effort) a list of exceptions along with the regular files?
* Data provider has a good strict XML schema, but the relevant information for a single item is spread across three files, inside a tar archive. Since the there are 500k files inside the archive, you best not want to extract it, but process it on the fly.
* Data provider chooses layout that saves every item in a single XML file, inside 2-3 levels of directories. There are 20M of them. Unzipping the archive alone takes more than a day with default system settings and the usual tools. How do you process these things fast?
There are more subtle issues as well:
* FFFD regularly occurs in natural language strings. Can you correct these strings?
* File has .csv ending, looks like CSV on first glance, but all the standard RFC compliant parsers choke on it.
* XML file that elements, that have RTF tags embedded in it. You need to parse the RTF in the elements, because there is relevant information there, that you need to add to the transformed version.
* Date issues. Inconsistent formats and almost-valid dates.
* Combine data, coming from an API with data fetched from ten different servers to produce a transformed version with a legacy command line application (that might be slow, so you have to split your data first and parallelize the work, combine it and make sure it's complete).
I am thinking about a longer article or even a short book about these kind of data handling and quality questions and what ways there are to address them. Would you read a book like this and what topic would be the most pressing or relevant?