CTRL DEC HEX CHR NAME
Ctrl-\ 28 1C FS File Separator (Right Arrow)
Ctrl-] 29 1D GS Group Separator (Left Arrow)
Ctrl-^ 30 1E RS Record Separator (Up Arrow)
Ctrl-_ 31 1F US Unit Separator (Down Arrow)
From https://www3.rocketsoftware.com/bluezone/help/v42/en/bzadmin...You can type them in the terminal by prefixing with Ctrl-V, so you can enter a record separator by pressing Ctrl-V, Ctrl-Shift-6. Typing Ctrl-\ is tricky because some programs interpret it to mean end of input, e.g. it exits the Python repl, but I don't think that one in particular is super important to type manually. In hindsight if these were assigned to Ctrl-<letter>, they would have been a lot easier to type and use.
Ctrl+shift+6 is a 3-key chord, it’s potentially hard to discover (I can’t say I’ve actually ever seen it), it seems likely to be overridden by applications, and caret isn’t a natural separator and is more commonly used for other things, like exponents.
A comma is 1 key on the keyboard, and it’s already a natural separator; the very meaning of comma is separator. Note how many commas are used in this thread compared to the number of record separators. :P
Having to type both ctrl+shift+6 and ctrl+shift+minus a lot seems like a small physical and mental friction compared to using commas and returns each time a character is typed, that adds up to a lot of physical and mental friction over time. Enough that the eventual implication is that you need better tooling than a text editor provides in order to author delimited files, enough that it sort of undermines the idea of having a text file. It’s a mistake to think that because a key chord exists that it’s a solved problem, and a mistake to underestimate the value of making commonly used items as simple as possible, especially if it’s going to affect a lot of different people.
I think the proposal can be improved by using ctrl+^ followed by a newline as a row separator, it looks much more readable plus will allow various line-based CLI tools to be used unless there are newlines in the cells.
or probably to see them in the screen in any sensible form.
Also, what to do if you want to embed those ctrl characters in a field? you are likely back to the way that CSV does it with quotes, commas and CRLFs.
Maybe I'm missing something, but wouldn't it still need escaping for those ASCII separator characters (or alternatively, a restriction for the stored text not to have them)?
It's true that having to deal with escaping much less often (since the ASCII separator characters are rarer than commas/quotes) would be convenient for manual reading/writing, but I feel that's canceled out by the characters being hard to type/see (likely the reason why they're rare) - and it wouldn't necessarily save on writer/parser code complexity.
The real problem is that there is no easy universal way to type them with a keyboard. So it would require software interfaces in the application, and at that point it’s basically binary.
The author's claim is "with no restrictions on the text".
It's easy if you can forbid certain characters, but then you can't store arbitrary text (e.g: filepaths, or scraped comments).
I don’t think the goal was to make a bullet-proof delimiter that fails at nothing.
The goal was to solve the problem of not allowing things like commas, quotes, newlines, tabs, pipes, etc. in text files.
I feel like using the proposed ASCII characters would eliminate these limitations, while also allowing machine creatable and readable format (emphasis on machine as opposed to human).
Yes, it would still be tough for a human to type or read these delimiters, so in that case, go with traditional CSV or TSV (or MVCSV!).
But if you only need to use a machine to create/read the text, this sounds like a great solution, allowing all of the normal characters you might see in text.
The delimiters can occur in binary data, or when there's nesting -- trying to store TSV in TSV, or JSON in JSON, etc. The latter definitely happens a lot, for better or worse
For the purposes of CSV, I consider text to be anything that satisfies the regex ^\P{Cc}+$ (https://www.compart.com/en/unicode/category/Cc) and I normally strip chars in that category before saving some text (for single-line text). ^[\p{Cc}&&[^\n]]+$ is a regex that can be used to strip all control chars except for the newline.
You can convert to another format if you need something crazier than rows and columns consisting of normal text.
1. if a value included the line separator, row separator or text qualifier surround the value with the text qualifier.
2. if the value contains the text qualifier double it in the value.
String.split(",").map(it.replace("\"\"", "\""))
to Spark insisting that backslash escapes exist in CSV.Never had a problem when both sides know the rules.
But I doubt those other system even support ASCII separated value files.
I remember Klarna using ", " as their separator. Not ",". There had to be a space as well, which most CSV parsers can not do. So when giving us a CSV file, with currency, and Swedish kronor used "," as the decimal separator you'd get some fun result. Pretty much every CSV parser we tried would assume that kronor and øre was two separate fields.
And I doubt those who create wrong CSVs can even handle ASVs.
CSV can be read and edited with any text editor.
Ironically, I looked for that very control character, and I think it may not had worked with Excel/Clipboard, so was a no-go for biz ops.
I never understood how people were able to abbreviate the U S into ␟ without triggering the USA flag, like on youtube.
After reading historical/proposed Unicode RFC's, having scrolled past every glyph that could combine into graphene clusters and fuzzed unicode input on many systems....
today I am humbled to learn that ␟ is not nor US, but the exact unit by which its own proliferation would itself de-nomen-ize itself.
https://raw.githubusercontent.com/theandrewbailey/OfflineBib...
A StringBuilder wouldn't work, since there's nothing left to concatenate.
There is no technical reason why CSV should have won out, except that keyboards have a comma key and almost never a ^A key.
Being unable to deal with this is a lazyness of the developer that spilled into the user being unable to deal with it. This is nothing a user can't be trained on, and I'd argue it makes more sense than weird escaping sequences in the event you actually do want a ",".
P.S.: But then again, with proper editors the escaping issue vanishes - and no, I do not mean IDE's. Lots of people decided it was worth it to support rtf, I figure the decision to support 2 additional characters is way easier in a user friendly way.
Something like:
5:hello
2:pi
Maybe with one blank line with no delimiter as a record separator.
All fields on the same line could work, and would be more greppable, but harder to read for humans.
https://en.wikipedia.org/wiki/Hollerith_constant
As soon as DJB pays up:
harder to write too - very easy to get the length wrong, and tiresome to have to count the length
Technically, XML is superior for data representation on many fronts. But likewise, it is an absolute PITA to maintain without significant editor support.
It is no accident that CSV/tabs 'won'.
Dealing with the variety of formats certainly isn’t the bottleneck in my productivity. Is it for others? I’d be curious why.
I typically use PowerShell to process the files from a unknown CSV format to a known one so it's easier to work with, and I've found it easy to use to iterate on.
1. Pandas is more mature with much better batch reading of larger than memory CSV files than Polars. But it’s slower and the syntax is worse.
2. Polars is my goto for one off analysis of CSV files that fit in memory. When max performance isn’t a concern, sometimes I’ll iterate through the CSV using Pandas to get it in batches, then immediately convert to Polars to do any analysis. ChatGPT has been poisoned by Polars’ early syntax changes so it often makes mistakes, but Polars’ syntax is so clean and consistent this often doesn’t matter much as it’s easy to fix.
3. DuckDB is a different beast obviously as it’s a full database, not just a single dataframe. It’s slightly more setup, but it has a CSV sniffer, does out of memory processing really well (no need to batch iterate) and lets you use SQL. I’m not too experienced at SQL yet, and it’s nice that ChatGPT is really pretty good at creating complex SQL queries. I am now gravitating to DuckDB for any larger than memory processing that can be handled in SQL. If line by line streaming is needed for the algorithm I’m implementing then I still use pandas or the pandas+polars approach.
At work we settled on using ^G (0x07) as a delimiter instead of TABs for file transfers and loading data into various databases.
The reason was Excel. People/systems who create these files sometimes source from Excel. And Excel can have a habit of placing odd characters in text fields. We found the one character never encountered was BEL.
For text fields we tend to remove embedded white space and after replacing TABs with 1 space.
2 invisible ascii characters evidently no one's rlly used for anything else (other than nextvalue nextline) sounds like an order of magnitute less pain on parsing the escaped escapes along with the whole junk paste of llm generated markdown+code junk probably pasted in there.
The bigger problem with CSV is all the inconsistent implementations. For example, some people want semicolons instead of commas because their culture uses commas as decimal points, so I suppose semicolon should really be the standard, if there was one.
Someone mentioned XML, but for most use cases XML is stupidly over-engineered. JSON is simpler - the entire specification is just a dozen or so pages.
For the unlikely event that you are dealing with data with the metacharacters: qsv will use some other control character as the “quote” character to deal with that.
I think CSV or TSV are good enough. People keep trying to find a format where you can separate the records and fields with a simple string.split and there's no need to contemplate escapes.
But that's not possible, no matter the format you'll have to parse it right. And then, a format that uses visual delimiters has the obvious advantage of being editable with any text editor.
This is a great example of not understanding what “the problem” actually is, and then assuming that because part of a technical solution exists, that everyone should be using it and if they’re not it’s because of ignorance rather than choice. I think we all do this, at least I know I’m sometimes guilty, but it’s amusing when faced with what happens in the real world at scale, to jump to the conclusion that the world is wrong rather than to first question our own assumptions.
Personally, I think it’s funny to assume that ASCII == text. Obviously not all ASCII is “text” in the sense that most people will assume. When people say “text file” I assume it contains nothing that you can’t type on a physical typewriter, other than the annoying and persistent difference between LF and CRLF. ASCII has lots of characters you can’t type on a typewriter, and are not intended to print as a character.
But if you want to invent new “text” characters for a “text” file, the problem suddenly becomes not just having a char code, but how to easily type it, how to easily display it, how to teach everyone to recognize and use it, and how to standardize these things so everyone knows them. Personally at this point I probably wouldn’t call a file with ASCII chars 28..31 in them “text”. The ASCII characters haven’t solved the overall problem, they have created several more and bigger problems that remain unsolved, and are much easier to solve in practice by using a comma instead, which is why people aren’t using the special ASCII characters in practice.
> We tried using the control characters, and also tried configuring various editors to show the control characters by rendering the control picture characters.
> First, we encountered many difficulties with editor configurations, attempting to make each editor treat the invisible zero-width characters by rendering with the visible letter-width characters.
> Second, we encountered problems with copy/paste functionality, where it often didn't work because the editor implementations and terminal implementations copied visible letter-width characters, not the underlying invisible zero-width characters.
> Third, users were unable to distinguish between the rendered control picture characters (e.g. the editor saw ASCII 31 and rendered Unicode Unit Separator) versus the control picture characters being in the data content (e.g. someone actually typed Unicode Unit Separator into the data content).
https://github.com/SixArm/usv/tree/main/doc/faq#why-use-cont...
I really liked this as it allowed me to add the glossary as an array in one of the columns. I wrote the parser my self which searches through the text structure, and it was simple enough. The reason I opted not to use a CSV or a TSV was that I didn’t want to deal with escaping surprise commas or tabs I would find in the dictionary data plus the extra dimension was nice. Since the file is generated, I didn’t have to type the characters my self so it had none of the downsides of this format honestly.
Not being a "developer", I have been productively using these non-printing separators for personal use as a UNIX-like OS and text-only internet user for close to three decades. Of course I have a bias for ASCII and against Unicode and I only use the English language for computing. Perhaps this is why using the ASCII charactors, including the record and file separators, work so well for me.
Using ASCII non-printing separators might not work for everybody but it would be false to assume it will not work for anybody.
Historically ASCII worked for some computer users. It still does today. For those who stil use it like myself.
The author states, "The most anoying[sic] thing about the whole problem is that it was solved by design in the ASCII character set."
"Developers" might not use the ASCII solution but that does not prevent other computer owners from using it.
This lack of precision in writing is annoying.