``` df = Explorer.DataFrame.from_csv(filename = "my_file.txt", delimiter: "|", infer_schema_length: nil) {:error, {:polars, "Could not parse `OTHER` as dtype Int64 at column 3.\nThe current offset in the file is 4447442 bytes.\n\nConsider specifying the correct dtype, increasing\nthe number of records used to infer the schema,\nrunning the parser with `ignore_parser_errors=true`\nor adding `OTHER` to the `null_values` list."}} ```
Note, I added `infer_schema_length: nil` assuming that the data type discovery via sampling was just less good in `polars`, since this would have it read the whole file before determine types, but it still failed.
The best way to do exactly what you're saying is just use R and do:
``` data.table::fread('my file.txt') |> arrow::write_parquet('new_file.parquet') ```
That will do the exact same thing-- sanitize the file, parsing and transforming the data correctly, logging questionable lines, and outputting a binary file that can be used by other systems later.
When you're working with thousands of files and hundreds of millions of lines every day and your client will be rightfully pissed if their data is off by $100,000 and my only resolution is to wait 2 weeks for someone in IT on their end upstream to _maybe_ fix the file, hopefully without introducing a new error...
Writing my own delimited file parser over a huge amount of community effort sounds like the worst case of not-invented-here syndrome ever. What stinks is how willing most of those projects are to fail silently.
The thing that would prevent such issues is validation of the data you accept.
In the case of text delimited files, it is simply too easy and too common to generate, from the start, a malformed file that other systems cannot read. Because data loss is inherent in a text-based format, folks don't even bother to check if the files they generate can be successfully interpreted by their own system. PostgreSQL, Oracle, and MS SQL will all gladly produce CSV files that cannot be read back successfully. I'm not talking about some loss of metadata, I'm talking cannot be read.
In the "real world", of course I run validations on the data I accept. A common one for me, since the files are essentially "append only" when they're updated is to check for meaningfully fewer records than previous data loads. That's my best way of determining that when the file was read, records were dropped or lost because of things like quoting being messed up or an incomplete file transfer.
It's still not great that a mismatched quote, which is quite common, doesn't even trigger a warning in the validation methods of these parsers.
JSON is much easier to validate and has similar ease of data transfer as with CSV. It can have minimal overhead as well if the data is stored as arrays (or an array of arrays).