undefined | Better HN

0 pointszAy0LfpBZLC8mAC12y ago0 comments

What if there is #COMMA, in one of the fields (but no #COMMA#)?

Yes, the assumption you have to make is called the grammar, and you better have a parser that always does what the grammar says, and global text replacement is a technique that is easy to get wrong, difficult to prove correct, and completely unnecessary at that.

0 comments

4 comments · 1 top-level

lignuist12y ago· 3 in thread

> What if there is #COMMA, in one of the fields (but no #COMMA#)?

What should happen? Since #COMMA is not #COMMA#, it gets not replaced, because it does not match.

Please keep in mind, that I replied to suni's very specific question and did not try to start a discussion about general parser theory. In practice, we find a lot of files that do not respect the grammar, but still need to find a way to make the data accessible.

zAy0LfpBZLC8mACOP12y ago

What would happen is that you first would replace #COMMA, with #COMMA#COMMA# and then later replace that with ,COMMA# , thus garbling the data.

The way to make the data accessible is to request the producer to be fixed, it's that simple. If that is completely impossible, you'll have to figure out the grammar of the data that you actually have and build a parser for that. Your suggested strategy does not work.

dbro12y ago

Usually the person parsing the CSV data doesn't have control over the way the data gets written. If he did, he would probably prefer something like protocol buffers. CSV is the lowest common denominator, so it's a useful format for exchanging data between different organizations that are producing and consuming the data.

https://github.com/dbro/csvquote is a small and fast script that can replace ambiguous separators (commas and newlines, for example) inside quoted fields, so that other text tools can work with a simple grammar. After that work is done, the ambiguous commas inside quoted fields get restored. I wrote it to use unix shell tools like cut, awk, ... with CSV files containing millions of records.

1 more reply

lignuist12y ago

I used that strategy for parsing gigabytes of CSVs containing arbitrary natural language from the web - try to get these files fixed, or figure out a grammar for gigabytes of fuzzy data...

My approach never failed for me, so telling me that my strategy does not work is a strong claim, where it reliably did the job for me.

Your examples are all valid, but what you are describing are theoretical attacks on the method, while the method works in almost all cases in practice. We are talking about two different viewpoints: dealing with large amounts of messy data on one hand and parser theory in an ideal cosmos on the other hand.

1 more reply

j / k navigate · click thread line to collapse

0 comments

4 comments · 1 top-level

lignuist12y ago· 3 in thread

> What if there is #COMMA, in one of the fields (but no #COMMA#)?

What should happen? Since #COMMA is not #COMMA#, it gets not replaced, because it does not match.

zAy0LfpBZLC8mACOP12y ago

What would happen is that you first would replace #COMMA, with #COMMA#COMMA# and then later replace that with ,COMMA# , thus garbling the data.

dbro12y ago

1 more reply

lignuist12y ago

I used that strategy for parsing gigabytes of CSVs containing arbitrary natural language from the web - try to get these files fixed, or figure out a grammar for gigabytes of fuzzy data...

My approach never failed for me, so telling me that my strategy does not work is a strong claim, where it reliably did the job for me.

1 more reply

j / k navigate · click thread line to collapse