ASCII Delimited Text – Not CSV or Tab Delimited Text (opens in new tab)

(ronaldduncan.wordpress.com)

114 pointsejstronge1y ago117 comments

117 comments

83 comments · 26 top-level

Apreche1y ago· 13 in thread

The shortcoming of using the control characters is that there is no easy way to type them on a keyboard. You can trivially edit csv in a text editor.

Asooka1y ago

Technically there is (not sure about easy, most of these require ctrl+shift, but they are on the keyboard):

    CTRL   DEC HEX CHR NAME
    Ctrl-\  28  1C  FS File Separator (Right Arrow)
    Ctrl-]  29  1D  GS Group Separator (Left Arrow)
    Ctrl-^  30  1E  RS Record Separator (Up Arrow)
    Ctrl-_  31  1F  US Unit Separator (Down Arrow)

From https://www3.rocketsoftware.com/bluezone/help/v42/en/bzadmin...

You can type them in the terminal by prefixing with Ctrl-V, so you can enter a record separator by pressing Ctrl-V, Ctrl-Shift-6. Typing Ctrl-\ is tricky because some programs interpret it to mean end of input, e.g. it exits the Python repl, but I don't think that one in particular is super important to type manually. In hindsight if these were assigned to Ctrl-<letter>, they would have been a lot easier to type and use.

zabzonk1y ago

ctrl-letters are used for other things

mattpallissard1y ago

It looks like it would just be be ctrl+^, which seems pretty straightforward.

dahart1y ago

Straightforward is completely subjective. But a comma is relatively much simpler in an absolute sense.

Ctrl+shift+6 is a 3-key chord, it’s potentially hard to discover (I can’t say I’ve actually ever seen it), it seems likely to be overridden by applications, and caret isn’t a natural separator and is more commonly used for other things, like exponents.

A comma is 1 key on the keyboard, and it’s already a natural separator; the very meaning of comma is separator. Note how many commas are used in this thread compared to the number of record separators. :P

Having to type both ctrl+shift+6 and ctrl+shift+minus a lot seems like a small physical and mental friction compared to using commas and returns each time a character is typed, that adds up to a lot of physical and mental friction over time. Enough that the eventual implication is that you need better tooling than a text editor provides in order to author delimited files, enough that it sort of undermines the idea of having a text file. It’s a mistake to think that because a key chord exists that it’s a solved problem, and a mistake to underestimate the value of making commonly used items as simple as possible, especially if it’s going to affect a lot of different people.

1 more reply

smarx0071y ago

ctrl+_ to separate cells, ctrl+^ to separate rows - works perfectly in notepad++.

I think the proposal can be improved by using ctrl+^ followed by a newline as a row separator, it looks much more readable plus will allow various line-based CLI tools to be used unless there are newlines in the cells.

zabzonk1y ago

> no easy way to type them on a keyboard

or probably to see them in the screen in any sensible form.

Also, what to do if you want to embed those ctrl characters in a field? you are likely back to the way that CSV does it with quotes, commas and CRLFs.

velcrovan1y ago

That’s the whole point of having field and record separators as distinct values in ASCII. There is no other valid use for them, so no escaping is necessary. Have you ever used ASCII value 30 for anything, anywhere, in your life?

3 more replies

edlebert1y ago

And equally as important, you can cat a CSV file and easily understand it.

bediger40001y ago

Or concatenate them, or diff or grep.

velcrovan1y ago

This is a bit silly. Any modern text editor (whether vim or VSCode or BBEdit or Notepad++ or whatever) is capable of displaying control characters and of copy/pasting them. Keyboard shortcuts for inserting any characters whatsoever are easy to add. And even with CSV files, if you’re editing them by hand rather than manipulating them with code, you’re probably doing it wrong.

pvg1y ago

All of this is easy (use the proper editor, configure it for this particular weirdass situation, do something other than the thing you want to do, etc) in a way that’s exactly analogous to ‘you can spin up your own dropbox over the weekend with ftp and rsync’.

jiehong1y ago

Maybe you with a hex editor that shows dual pane: hex and text.

NBJack1y ago

I asser that is still not convenient or scalable. You will need to mentally parse each number (two characters) to ensure it is correct. Compare this with a simple glyph (i.e. a single comma) which is easy to eyeball.

Ukv1y ago· 11 in thread

> with no restrictions on the text in fields or the need to try and escape characters.

Maybe I'm missing something, but wouldn't it still need escaping for those ASCII separator characters (or alternatively, a restriction for the stored text not to have them)?

It's true that having to deal with escaping much less often (since the ASCII separator characters are rarer than commas/quotes) would be convenient for manual reading/writing, but I feel that's canceled out by the characters being hard to type/see (likely the reason why they're rare) - and it wouldn't necessarily save on writer/parser code complexity.

missblit1y ago

See this is why I once used moon-viewing-ceremony-seperated-values (MVCSV). The Moon Viewing Ceremony emoji was unlikely to show up in my dataset, and not only is the emoji visible, it's quite visually pleasing.

dazzaji1y ago

I’m now free-falling down a moon viewing ceremony rabbit hole of emoji history, and enjoying the ride!

Macha1y ago

Wait until you expand into the Japanese market and all your users are talking about 月見

dec0dedab0de1y ago

Not if you just say those characters are invalid data. I first heard about them decades ago, but I don't think I have ever once seen them in use.

The real problem is that there is no easy universal way to type them with a keyboard. So it would require software interfaces in the application, and at that point it’s basically binary.

Ukv1y ago

> Not if you just say those characters are invalid data

The author's claim is "with no restrictions on the text".

It's easy if you can forbid certain characters, but then you can't store arbitrary text (e.g: filepaths, or scraped comments).

1 more reply

jader2011y ago

I feel like this (and some of the replies to this) is missing the point a bit.

I don’t think the goal was to make a bullet-proof delimiter that fails at nothing.

The goal was to solve the problem of not allowing things like commas, quotes, newlines, tabs, pipes, etc. in text files.

I feel like using the proposed ASCII characters would eliminate these limitations, while also allowing machine creatable and readable format (emphasis on machine as opposed to human).

Yes, it would still be tough for a human to type or read these delimiters, so in that case, go with traditional CSV or TSV (or MVCSV!).

But if you only need to use a machine to create/read the text, this sounds like a great solution, allowing all of the normal characters you might see in text.

Ukv1y ago

If you need a machine-readable format, why not go with escaping like most other formats, or length-before-text, to include all characters - instead of a format that fails on some (albeit rare) characters?

1 more reply

chubot1y ago

Yup exactly, it just pushes the problem around, without solving it.

The delimiters can occur in binary data, or when there's nesting -- trying to store TSV in TSV, or JSON in JSON, etc. The latter definitely happens a lot, for better or worse

smarx0071y ago

The title says ASCII Delimited Text not ASCII Delimited Binary Data.

For the purposes of CSV, I consider text to be anything that satisfies the regex ^\P{Cc}+$ (https://www.compart.com/en/unicode/category/Cc) and I normally strip chars in that category before saving some text (for single-line text). ^[\p{Cc}&&[^\n]]+$ is a regex that can be used to strip all control chars except for the newline.

1 more reply

chrishill891y ago

You can disallow those metacharacters in the data proper. Then you have a format that can store any utf8 or whatever except the non-whitespace control codes without any escaping. That solves a problem in an opinionated way. Just like how json is opinionated (utf8 only).

You can convert to another format if you need something crazier than rows and columns consisting of normal text.

1 more reply

huem0n1y ago

Thank you for writing that complaint out so I don't have to. It solves nothing.

croes1y ago· 7 in thread

CSV isn‘t that complicated if done right.

1. if a value included the line separator, row separator or text qualifier surround the value with the text qualifier.

2. if the value contains the text qualifier double it in the value.

Macha1y ago

That assumes you control both the producer and consumer. But if you're doing CSV, it's likely because you're looking to integrate with someone else's system. So you have to deal with whatever they're doing that they call CSV. And if "they" are "all your customers", you're going to encounter every weird quirk of different system's CSV parsing from that guy who just used

    String.split(",").map(it.replace("\"\"", "\""))

to Spark insisting that backslash escapes exist in CSV.

croes1y ago

That’s the real problem. That those system claim the read or write CSV but in reality it’s a bastard form of CSV.

Never had a problem when both sides know the rules.

But I doubt those other system even support ASCII separated value files.

1 more reply

nly1y ago

It's easier in most cases to just use a parser flexible enough for you to specify whatever variant the producer actually emitted.

1 more reply

mrweasel1y ago

But it's so rarely done right.

I remember Klarna using ", " as their separator. Not ",". There had to be a space as well, which most CSV parsers can not do. So when giving us a CSV file, with currency, and Swedish kronor used "," as the decimal separator you'd get some fun result. Pretty much every CSV parser we tried would assume that kronor and øre was two separate fields.

croes1y ago

I doubt ASV would be done right more often

ekianjo1y ago

you don't always control how csv files are made. Most of the time you are just given them, and this is when you start pulling your hair. CSV is a terrible, terrible format, because it fails in too many use cases.

croes1y ago

CSV doesn‘t fail, they fail to handle CSV.

And I doubt those who create wrong CSVs can even handle ASVs.

CSV can be read and edited with any text editor.

theandrewbailey1y ago· 5 in thread

I've used these when I've had some code with thousands of strings. I concatenated them with the ASCII separators in the source code, then called String.split as needed. The speedup was noticeable, probably since the runtime choked on instantiating so many strings at one time when launched.

foxglacier1y ago

But really you could have used any other character that wasn't going to appear in your strings, especially a visible one like “␟” U+241F "Symbol For Unit Separator".

Jerrrrrrry1y ago

i actually used [U+263A] and [U+263B] for this purpose, ignorantly (in good faith)...in pro/gov/civ projects....not realizing the canonized name wasn't "Smiley/Inverted Smiley" at the time, which may have been an oversight.

Ironically, I looked for that very control character, and I think it may not had worked with Excel/Clipboard, so was a no-go for biz ops.

I never understood how people were able to abbreviate the U S into ␟ without triggering the USA flag, like on youtube.

After reading historical/proposed Unicode RFC's, having scrolled past every glyph that could combine into graphene clusters and fuzzed unicode input on many systems....

today I am humbled to learn that ␟ is not nor US, but the exact unit by which its own proliferation would itself de-nomen-ize itself.

Jerrrrrrry1y ago

StringBuilder in both JS and .Net has a class especially for this.

theandrewbailey1y ago

I dug up where I used it. It was a bible in HTML and JS. At first, it was using arrays of arrays of strings (for chapters and verses), but I refactored it to really long strings for each book with those separators. The entire bible is one JSON object in the source, keyed by book and the values are those really long strings.

https://raw.githubusercontent.com/theandrewbailey/OfflineBib...

A StringBuilder wouldn't work, since there's nothing left to concatenate.

foxglacier1y ago

That's for concatenating. He's splitting.

1 more reply

aristus1y ago· 4 in thread

In the early 2000s, back at the beginning of the world, Yahoo's web code used ^A and ^B for field and record separators to avoid having to escape commas and quotes and newlines. That was probably the last time I ever saw ASCII control characters used as intended in the wild.

There is no technical reason why CSV should have won out, except that keyboards have a comma key and almost never a ^A key.

jbeninger1y ago

That's a huge technical obstacle for most people though. The whole point of XSV formats is to be human editable. There are better formats for computer-to computer records. If you can't the core delimiters on a keyboard, your format is going to lose out despite any of us other benefits

ablob1y ago

It's an editor issue then. "Back in the old days", people used to understand how to input ^A and ^B. Showing these characters is also only a mere addition in the character set. Sure, there is inertia to change, but even rich text format is/was supported by windows.

Being unable to deal with this is a lazyness of the developer that spilled into the user being unable to deal with it. This is nothing a user can't be trained on, and I'd argue it makes more sense than weird escaping sequences in the event you actually do want a ",".

P.S.: But then again, with proper editors the escaping issue vanishes - and no, I do not mean IDE's. Lots of people decided it was worth it to support rtf, I figure the decision to support 2 additional characters is way easier in a user friendly way.

packetlost1y ago

That doesn't need to be an obstacle, a graphical editor that lets you click a button to add a row/column exists for nearly every other tabular format. It only matters if you want to edit the file using a plaintext editor. If the format were popular, shortcuts would be created to enter the delimiters in many plaintext editors too, which is a chicken/egg problem but let's not kid ourselves into thinking it's not a solvable problem.

1 more reply

cempaka1y ago

The FIX financial protocol still uses ^A.

jiehong1y ago· 4 in thread

Perhaps we should someday have length delimited text formats, and editors should recalculate the length on the fly.

Something like:

5:hello

2:pi

Maybe with one blank line with no delimiter as a record separator.

All fields on the same line could work, and would be more greppable, but harder to read for humans.

nly1y ago

1966 called and wants its idea back

https://en.wikipedia.org/wiki/Hollerith_constant

As soon as DJB pays up:

http://cr.yp.to/proto/netstrings.txt

zabzonk1y ago

> but harder to read for humans

harder to write too - very easy to get the length wrong, and tiresome to have to count the length

1 more reply

formerly_proven1y ago

There are several formats like this, the most well-known is probably canonical S-expressions. Or, if I'm being sadistic, PDF is somewhat like this, too.

zelphirkalt1y ago

That is the annoying format that Wordpress uses to store text. It does not lend itself for search and replace very well.

NBJack1y ago· 3 in thread

Kind of a short sighted take. Sticking special characters that (in many early editors) would be invisible complicates development and maintenance. Even tabs have a visual, albeit inconsistent (if your editor wants to align columns for you) manifestation you can work with.

Technically, XML is superior for data representation on many fronts. But likewise, it is an absolute PITA to maintain without significant editor support.

It is no accident that CSV/tabs 'won'.

Miner49er1y ago

Seems like a tooling problem though. I don't think it would be that difficult to have editors draw them in a readable way.

nolok1y ago

If you are okay with needing tooling to be able to edit half decently then the problem it "fixes" in CSV doesn't exists to begin with

sandreas1y ago

Yeah and many devs often oversee that with having separator, enclosure and escapechar specified and used, it even supports newlines in its cell values

directevolve1y ago· 3 in thread

How big an issue is CSV format really? I work in bioinformatics where it seems like everything is one odd CSV-like format or another. In Python, I have access to tools like pandas, duckdb, and polars, which have detailed ingestion options and sometimes a sniffer. I can read part of a file and check in seconds if it looks right.

Dealing with the variety of formats certainly isn’t the bottleneck in my productivity. Is it for others? I’d be curious why.

fuzztester1y ago

The Python csv module has a dialect option.

https://docs.python.org/3/library/csv.html

accrual1y ago

I also work with foreign CSVs regularly. I'll have to try the Python Way next time I have a weird file to work with.

I typically use PowerShell to process the files from a unknown CSV format to a known one so it's easier to work with, and I've found it easy to use to iterate on.

directevolve1y ago

Oh yeah that does sound challenging. If you’re interested, here’s my take on the three libraries I mentioned.

1. Pandas is more mature with much better batch reading of larger than memory CSV files than Polars. But it’s slower and the syntax is worse.

2. Polars is my goto for one off analysis of CSV files that fit in memory. When max performance isn’t a concern, sometimes I’ll iterate through the CSV using Pandas to get it in batches, then immediately convert to Polars to do any analysis. ChatGPT has been poisoned by Polars’ early syntax changes so it often makes mistakes, but Polars’ syntax is so clean and consistent this often doesn’t matter much as it’s easy to fix.

3. DuckDB is a different beast obviously as it’s a full database, not just a single dataframe. It’s slightly more setup, but it has a CSV sniffer, does out of memory processing really well (no need to batch iterate) and lets you use SQL. I’m not too experienced at SQL yet, and it’s nice that ChatGPT is really pretty good at creating complex SQL queries. I am now gravitating to DuckDB for any larger than memory processing that can be handled in SQL. If line by line streaming is needed for the algorithm I’m implementing then I still use pandas or the pandas+polars approach.

1 more reply

jmclnx1y ago· 2 in thread

I like this format best of all, CSV is # 2 favorite.

At work we settled on using ^G (0x07) as a delimiter instead of TABs for file transfers and loading data into various databases.

The reason was Excel. People/systems who create these files sometimes source from Excel. And Excel can have a habit of placing odd characters in text fields. We found the one character never encountered was BEL.

For text fields we tend to remove embedded white space and after replacing TABs with 1 space.

normie30001y ago

This sounds like a noisy format.

fuzztester1y ago

Silently ignore it.

foxglacier1y ago· 2 in thread

If people used these for ASCII delimited text, they'd have to not use them for anything else, like some other text format otherwise you might insert an entire ASCII delimited file into a text field of that other thing and break that other thing's parsing. You couldn't even insert part of a file into a string field in another ASCII-delimited file. You only get to use them once so they wouldn't be part of general purpose plain text and an ASCII delimited file wouldn't be a plain text file that you could treat in the same way as other text files, so it's effectively a binary format or has restrictions on what text characters can appear in its records without escaping - oh no, that was its entire value proposition!

nisten1y ago

they break all the time, whole point is to have less pain.

2 invisible ascii characters evidently no one's rlly used for anything else (other than nextvalue nextline) sounds like an order of magnitute less pain on parsing the escaped escapes along with the whole junk paste of llm generated markdown+code junk probably pasted in there.

foxglacier1y ago

You still have to escape the delimiters to be safe except now they're more rare so easier to forget about.

The bigger problem with CSV is all the inconsistent implementations. For example, some people want semicolons instead of commas because their culture uses commas as decimal points, so I suppose semicolon should really be the standard, if there was one.

haddr1y ago· 1 in thread

The fact that CSV is still strong is that it already covers all „shortcomings” (I.e. presence of quotations in the content) mentioned by this article.

cellardweller1y ago

Yep, the only advantage I see with using ASCII control characters is that you can save a few bytes depending on the content. To make this approach robust, escaping is still needed.

bradley131y ago· 1 in thread

Nice idea, but as others have pointed out, non-printable characters pose their own problems. People expect to be able to edit CSV files.

Someone mentioned XML, but for most use cases XML is stupidly over-engineered. JSON is simpler - the entire specification is just a dozen or so pages.

accrual1y ago

I still see a lot of XML in SOAP/WSDL APIs, typically in Microsoft shops, but thankfully JSON feels like the norm when IIS isn't involved.

chrishill891y ago· 1 in thread

You can use ASCII-separated values in qsv.[1]

For the unlikely event that you are dealing with data with the metacharacters: qsv will use some other control character as the “quote” character to deal with that.

chrishill891y ago

Whoops, meant to link to qsv https://github.com/jqnatividad/qsv

tangus1y ago

And how do we escape those characters? With ESC (27)? Inside a SI/SO (15/14) pair?

I think CSV or TSV are good enough. People keep trying to find a format where you can separate the records and fields with a simple string.split and there's no need to contemplate escapes.

But that's not possible, no matter the format you'll have to parse it right. And then, a format that uses visual delimiters has the obvious advantage of being editable with any text editor.

dahart1y ago

> The most anoying thing about the whole problem is that it was solved by design in the ASCII character set.

This is a great example of not understanding what “the problem” actually is, and then assuming that because part of a technical solution exists, that everyone should be using it and if they’re not it’s because of ignorance rather than choice. I think we all do this, at least I know I’m sometimes guilty, but it’s amusing when faced with what happens in the real world at scale, to jump to the conclusion that the world is wrong rather than to first question our own assumptions.

Personally, I think it’s funny to assume that ASCII == text. Obviously not all ASCII is “text” in the sense that most people will assume. When people say “text file” I assume it contains nothing that you can’t type on a physical typewriter, other than the annoying and persistent difference between LF and CRLF. ASCII has lots of characters you can’t type on a typewriter, and are not intended to print as a character.

But if you want to invent new “text” characters for a “text” file, the problem suddenly becomes not just having a char code, but how to easily type it, how to easily display it, how to teach everyone to recognize and use it, and how to standardize these things so everyone knows them. Personally at this point I probably wouldn’t call a file with ASCII chars 28..31 in them “text”. The ASCII characters haven’t solved the overall problem, they have created several more and bigger problems that remain unsolved, and are much easier to solve in practice by using a comma instead, which is why people aren’t using the special ASCII characters in practice.

spiffytech1y ago

Some notes from when the USV project tried using control characters:

> We tried using the control characters, and also tried configuring various editors to show the control characters by rendering the control picture characters.

> First, we encountered many difficulties with editor configurations, attempting to make each editor treat the invisible zero-width characters by rendering with the visible letter-width characters.

> Second, we encountered problems with copy/paste functionality, where it often didn't work because the editor implementations and terminal implementations copied visible letter-width characters, not the underlying invisible zero-width characters.

> Third, users were unable to distinguish between the rendered control picture characters (e.g. the editor saw ASCII 31 and rendered Unicode Unit Separator) versus the control picture characters being in the data content (e.g. someone actually typed Unicode Unit Separator into the data content).

https://github.com/SixArm/usv/tree/main/doc/faq#why-use-cont...

runarberg1y ago

I’m working on a PWA which includes a dictionary search[1] feature and only a static web server (so no server side database to optimize the search). I did want searching to work in offline mode anyway. I decided it was best to generate an index file which the users download on first visit. For some reason I found USV[2] to be the best fit for this. USV I think allows seperating with ASCII control characters, but I used the unicode variants (␟, ␞, and ␝).

I really liked this as it allowed me to add the glossary as an array in one of the columns. I wrote the parser my self which searches through the text structure, and it was simple enough. The reason I opted not to use a CSV or a TSV was that I didn’t want to deal with escaping surprise commas or tabs I would find in the dictionary data plus the extra dimension was nice. Since the file is generated, I didn’t have to type the characters my self so it had none of the downsides of this format honestly.

1: https://shodoku.app/dictionary

2: https://github.com/SixArm/usv

1vuio0pswjnm71y ago

"Then you have a text file format that is trivial to write out and read in, with no restrictions on the text in fields or the need to try and escape characters."

Not being a "developer", I have been productively using these non-printing separators for personal use as a UNIX-like OS and text-only internet user for close to three decades. Of course I have a bias for ASCII and against Unicode and I only use the English language for computing. Perhaps this is why using the ASCII charactors, including the record and file separators, work so well for me.

Using ASCII non-printing separators might not work for everybody but it would be false to assume it will not work for anybody.

Historically ASCII worked for some computer users. It still does today. For those who stil use it like myself.

The author states, "The most anoying[sic] thing about the whole problem is that it was solved by design in the ASCII character set."

"Developers" might not use the ASCII solution but that does not prevent other computer owners from using it.

zaxomi1y ago

I sometimes use them for machine to machine transfer. The biggest problem is that regular editors don't handle it in a sensible way.

robsh1y ago

All we need is native Excel support, and HTML5 web support. In web browsers it should be the default copy formatting, and if you’re writing an HTML document these characters should be an alternative to using TD and TR tags.

calibas1y ago

I think this would catch on much more quickly if text editors treated the Record Separator character as a new line, and there was a special character for the Unit Separator.

mannyv1y ago

Tab and commas are ascii characters, so a csv file and a tdf are ascii-delimited by definition.

This lack of precision in writing is annoying.

tpoacher1y ago

people saying \034 / \035 are not readable / printable so they don't make good human readable delimiters: make it ,\034 and \n\035. looks like csv, but is actually ascii delimited. just remove last character from all entries.

apitman1y ago

Would love to see an explanation and some examples of what this would look like to work with for common use cases.

gabrielsroka1y ago

2009. has been shared here many times before

ribcage1y ago

plaintext is obsolete. Only good for storing passwords.

j / k navigate · click thread line to collapse

117 comments

83 comments · 26 top-level

Apreche1y ago· 13 in thread

The shortcoming of using the control characters is that there is no easy way to type them on a keyboard. You can trivially edit csv in a text editor.

Asooka1y ago

Technically there is (not sure about easy, most of these require ctrl+shift, but they are on the keyboard):

    CTRL   DEC HEX CHR NAME
    Ctrl-\  28  1C  FS File Separator (Right Arrow)
    Ctrl-]  29  1D  GS Group Separator (Left Arrow)
    Ctrl-^  30  1E  RS Record Separator (Up Arrow)
    Ctrl-_  31  1F  US Unit Separator (Down Arrow)

From https://www3.rocketsoftware.com/bluezone/help/v42/en/bzadmin...

zabzonk1y ago

ctrl-letters are used for other things

mattpallissard1y ago

It looks like it would just be be ctrl+^, which seems pretty straightforward.

dahart1y ago

Straightforward is completely subjective. But a comma is relatively much simpler in an absolute sense.

1 more reply

smarx0071y ago

ctrl+_ to separate cells, ctrl+^ to separate rows - works perfectly in notepad++.

zabzonk1y ago

> no easy way to type them on a keyboard

or probably to see them in the screen in any sensible form.

Also, what to do if you want to embed those ctrl characters in a field? you are likely back to the way that CSV does it with quotes, commas and CRLFs.

velcrovan1y ago

3 more replies

edlebert1y ago

And equally as important, you can cat a CSV file and easily understand it.

bediger40001y ago

Or concatenate them, or diff or grep.

velcrovan1y ago

pvg1y ago

jiehong1y ago

Maybe you with a hex editor that shows dual pane: hex and text.

NBJack1y ago

Ukv1y ago· 11 in thread

> with no restrictions on the text in fields or the need to try and escape characters.

Maybe I'm missing something, but wouldn't it still need escaping for those ASCII separator characters (or alternatively, a restriction for the stored text not to have them)?

missblit1y ago

dazzaji1y ago

I’m now free-falling down a moon viewing ceremony rabbit hole of emoji history, and enjoying the ride!

Macha1y ago

Wait until you expand into the Japanese market and all your users are talking about 月見

dec0dedab0de1y ago

Not if you just say those characters are invalid data. I first heard about them decades ago, but I don't think I have ever once seen them in use.

The real problem is that there is no easy universal way to type them with a keyboard. So it would require software interfaces in the application, and at that point it’s basically binary.

Ukv1y ago

> Not if you just say those characters are invalid data

The author's claim is "with no restrictions on the text".

It's easy if you can forbid certain characters, but then you can't store arbitrary text (e.g: filepaths, or scraped comments).

1 more reply

jader2011y ago

I feel like this (and some of the replies to this) is missing the point a bit.

I don’t think the goal was to make a bullet-proof delimiter that fails at nothing.

The goal was to solve the problem of not allowing things like commas, quotes, newlines, tabs, pipes, etc. in text files.

I feel like using the proposed ASCII characters would eliminate these limitations, while also allowing machine creatable and readable format (emphasis on machine as opposed to human).

Yes, it would still be tough for a human to type or read these delimiters, so in that case, go with traditional CSV or TSV (or MVCSV!).

But if you only need to use a machine to create/read the text, this sounds like a great solution, allowing all of the normal characters you might see in text.

Ukv1y ago

1 more reply

chubot1y ago

Yup exactly, it just pushes the problem around, without solving it.

The delimiters can occur in binary data, or when there's nesting -- trying to store TSV in TSV, or JSON in JSON, etc. The latter definitely happens a lot, for better or worse

smarx0071y ago

The title says ASCII Delimited Text not ASCII Delimited Binary Data.

1 more reply

chrishill891y ago

You can convert to another format if you need something crazier than rows and columns consisting of normal text.

1 more reply

huem0n1y ago

Thank you for writing that complaint out so I don't have to. It solves nothing.

croes1y ago· 7 in thread

CSV isn‘t that complicated if done right.

1. if a value included the line separator, row separator or text qualifier surround the value with the text qualifier.

2. if the value contains the text qualifier double it in the value.

Macha1y ago

    String.split(",").map(it.replace("\"\"", "\""))

to Spark insisting that backslash escapes exist in CSV.

croes1y ago

That’s the real problem. That those system claim the read or write CSV but in reality it’s a bastard form of CSV.

Never had a problem when both sides know the rules.

But I doubt those other system even support ASCII separated value files.

1 more reply

nly1y ago

It's easier in most cases to just use a parser flexible enough for you to specify whatever variant the producer actually emitted.

1 more reply

mrweasel1y ago

But it's so rarely done right.

croes1y ago

I doubt ASV would be done right more often

ekianjo1y ago

croes1y ago

CSV doesn‘t fail, they fail to handle CSV.

And I doubt those who create wrong CSVs can even handle ASVs.

CSV can be read and edited with any text editor.

theandrewbailey1y ago· 5 in thread

foxglacier1y ago

But really you could have used any other character that wasn't going to appear in your strings, especially a visible one like “␟” U+241F "Symbol For Unit Separator".

Jerrrrrrry1y ago

Ironically, I looked for that very control character, and I think it may not had worked with Excel/Clipboard, so was a no-go for biz ops.

I never understood how people were able to abbreviate the U S into ␟ without triggering the USA flag, like on youtube.

After reading historical/proposed Unicode RFC's, having scrolled past every glyph that could combine into graphene clusters and fuzzed unicode input on many systems....

today I am humbled to learn that ␟ is not nor US, but the exact unit by which its own proliferation would itself de-nomen-ize itself.

Jerrrrrrry1y ago

StringBuilder in both JS and .Net has a class especially for this.

theandrewbailey1y ago

https://raw.githubusercontent.com/theandrewbailey/OfflineBib...

A StringBuilder wouldn't work, since there's nothing left to concatenate.

foxglacier1y ago

That's for concatenating. He's splitting.

1 more reply

aristus1y ago· 4 in thread

There is no technical reason why CSV should have won out, except that keyboards have a comma key and almost never a ^A key.

jbeninger1y ago

ablob1y ago

packetlost1y ago

1 more reply

cempaka1y ago

The FIX financial protocol still uses ^A.

jiehong1y ago· 4 in thread

Perhaps we should someday have length delimited text formats, and editors should recalculate the length on the fly.

Something like:

5:hello

2:pi

Maybe with one blank line with no delimiter as a record separator.

All fields on the same line could work, and would be more greppable, but harder to read for humans.

nly1y ago

1966 called and wants its idea back

https://en.wikipedia.org/wiki/Hollerith_constant

As soon as DJB pays up:

http://cr.yp.to/proto/netstrings.txt

zabzonk1y ago

> but harder to read for humans

harder to write too - very easy to get the length wrong, and tiresome to have to count the length

1 more reply

formerly_proven1y ago

There are several formats like this, the most well-known is probably canonical S-expressions. Or, if I'm being sadistic, PDF is somewhat like this, too.

zelphirkalt1y ago

That is the annoying format that Wordpress uses to store text. It does not lend itself for search and replace very well.

NBJack1y ago· 3 in thread

Technically, XML is superior for data representation on many fronts. But likewise, it is an absolute PITA to maintain without significant editor support.

It is no accident that CSV/tabs 'won'.

Miner49er1y ago

Seems like a tooling problem though. I don't think it would be that difficult to have editors draw them in a readable way.

nolok1y ago

If you are okay with needing tooling to be able to edit half decently then the problem it "fixes" in CSV doesn't exists to begin with

sandreas1y ago

Yeah and many devs often oversee that with having separator, enclosure and escapechar specified and used, it even supports newlines in its cell values

directevolve1y ago· 3 in thread

Dealing with the variety of formats certainly isn’t the bottleneck in my productivity. Is it for others? I’d be curious why.

fuzztester1y ago

The Python csv module has a dialect option.

https://docs.python.org/3/library/csv.html

accrual1y ago

I also work with foreign CSVs regularly. I'll have to try the Python Way next time I have a weird file to work with.

I typically use PowerShell to process the files from a unknown CSV format to a known one so it's easier to work with, and I've found it easy to use to iterate on.

directevolve1y ago

Oh yeah that does sound challenging. If you’re interested, here’s my take on the three libraries I mentioned.

1. Pandas is more mature with much better batch reading of larger than memory CSV files than Polars. But it’s slower and the syntax is worse.

1 more reply

jmclnx1y ago· 2 in thread

I like this format best of all, CSV is # 2 favorite.

At work we settled on using ^G (0x07) as a delimiter instead of TABs for file transfers and loading data into various databases.

For text fields we tend to remove embedded white space and after replacing TABs with 1 space.

normie30001y ago

This sounds like a noisy format.

fuzztester1y ago

Silently ignore it.

foxglacier1y ago· 2 in thread

nisten1y ago

they break all the time, whole point is to have less pain.

foxglacier1y ago

You still have to escape the delimiters to be safe except now they're more rare so easier to forget about.

haddr1y ago· 1 in thread

The fact that CSV is still strong is that it already covers all „shortcomings” (I.e. presence of quotations in the content) mentioned by this article.

cellardweller1y ago

Yep, the only advantage I see with using ASCII control characters is that you can save a few bytes depending on the content. To make this approach robust, escaping is still needed.

bradley131y ago· 1 in thread

Nice idea, but as others have pointed out, non-printable characters pose their own problems. People expect to be able to edit CSV files.

Someone mentioned XML, but for most use cases XML is stupidly over-engineered. JSON is simpler - the entire specification is just a dozen or so pages.

accrual1y ago

I still see a lot of XML in SOAP/WSDL APIs, typically in Microsoft shops, but thankfully JSON feels like the norm when IIS isn't involved.

chrishill891y ago· 1 in thread

You can use ASCII-separated values in qsv.[1]

For the unlikely event that you are dealing with data with the metacharacters: qsv will use some other control character as the “quote” character to deal with that.

chrishill891y ago

Whoops, meant to link to qsv https://github.com/jqnatividad/qsv

tangus1y ago

And how do we escape those characters? With ESC (27)? Inside a SI/SO (15/14) pair?

I think CSV or TSV are good enough. People keep trying to find a format where you can separate the records and fields with a simple string.split and there's no need to contemplate escapes.

But that's not possible, no matter the format you'll have to parse it right. And then, a format that uses visual delimiters has the obvious advantage of being editable with any text editor.

dahart1y ago

> The most anoying thing about the whole problem is that it was solved by design in the ASCII character set.

spiffytech1y ago

Some notes from when the USV project tried using control characters:

> We tried using the control characters, and also tried configuring various editors to show the control characters by rendering the control picture characters.

> First, we encountered many difficulties with editor configurations, attempting to make each editor treat the invisible zero-width characters by rendering with the visible letter-width characters.

https://github.com/SixArm/usv/tree/main/doc/faq#why-use-cont...

runarberg1y ago

1: https://shodoku.app/dictionary

2: https://github.com/SixArm/usv

1vuio0pswjnm71y ago

"Then you have a text file format that is trivial to write out and read in, with no restrictions on the text in fields or the need to try and escape characters."

Using ASCII non-printing separators might not work for everybody but it would be false to assume it will not work for anybody.

Historically ASCII worked for some computer users. It still does today. For those who stil use it like myself.

The author states, "The most anoying[sic] thing about the whole problem is that it was solved by design in the ASCII character set."

"Developers" might not use the ASCII solution but that does not prevent other computer owners from using it.

zaxomi1y ago

I sometimes use them for machine to machine transfer. The biggest problem is that regular editors don't handle it in a sensible way.

robsh1y ago

calibas1y ago

I think this would catch on much more quickly if text editors treated the Record Separator character as a new line, and there was a special character for the Unit Separator.

mannyv1y ago

Tab and commas are ascii characters, so a csv file and a tdf are ascii-delimited by definition.

This lack of precision in writing is annoying.

tpoacher1y ago

apitman1y ago

Would love to see an explanation and some examples of what this would look like to work with for common use cases.

gabrielsroka1y ago

2009. has been shared here many times before

ribcage1y ago

plaintext is obsolete. Only good for storing passwords.

j / k navigate · click thread line to collapse