undefined | Better HN

0 pointsrswail4d ago0 comments

Things that have bugged me for 40 years...

* NUL terminated strings (and now, non UTF-8 encoded strings on input/output)

* Using LF or CR or CRLF as line terminators, and pipe/comma-delimited fields when there were other unambiguous ASCII characters that could have been used (eg, GS, FS, RS) that would have made the encoding/decoding of line termination an I/O thing keeping HT/VT/CR/LF/FF as literally print related codes.

0 comments

22 comments · 5 top-level

flohofwoe4d ago· 11 in thread

> non UTF-8 encoded strings on input/output

UTF-8 on stdin/stdout works perfectly fine (unless you are on Windows of course, which is stuck in in the early 90s when it comes to international text encoding).

> Using LF or CR or CRLF as line terminators

This is also an operating system convention, and it would be better if programming languages wouldn't try to "guess" the correct line endings, since this causes more problems than it solves - but again, this is mostly a Windows specific problem, and it's Microsoft's job to finally bring Windows into the current century.

rswailOP4d ago

No, it was an Apple, Unix, and Microsoft problem.

Unix used LF, Apple used CR, Microsoft used CRLF.

They are all ASCII carriage movement codes, which is about driving the paper feed and print head of an ASR-33 or equivalent.

So they all made the "wrong" decision about what to store in a file.

They just chose different wrong characters.

flohofwoe4d ago

> Apple used CR

Apple hasn't been using CR since the release of OSX (26 years ago). Microsoft could have made the switch at any time too (just as they could have switched to UTF-8 as universal text encoding on Windows), they just choose not to.

In the end it's not the job of programming languages to clean up Microsoft's mess ;)

rswailOP3d ago

We're literally talking about two decades before that.

bobmcnamara4d ago

The switch sure sucked though. I doubt Microsoft would risk their reputation for backwards compatibility.

parineum4d ago

> In the end it's not the job of programming languages to clean up Microsoft's mess ;)

Why is it Microsoft's fault? They just stayed on their legacy implementation, Linux and Apple chose to move from the legacy implementation to another legacy implementation. That seems dumb.

kps4d ago

> They just chose different wrong characters.

Unix followed Multics. Multics chose right. ASCII/EMCA-6/ISO646 drafts discussed this at least as early as 1963¹: “For equipment which uses a single combination (called New Line) [...] NL will be coded at FE₂ [Field Effector 2 = 0x0A].”

¹ doi/10.1093/comjnl/7.3.197

Parodper4d ago

UNIX's LF precedes them by at least half a decade, probably more.

bobmcnamara4d ago

CRLR is Baudot, predating UNIX by what, a century ?

1 more reply

JdeBP3d ago

rswail said ASCII, which definitely pre-dated Unix, not the other way around. And there was some to and fro about the equivalence of LineFeed and NewLine in the 1960s.

skywhopper4d ago

What programming languages try to guess line endings? Or are even aware of them?

flohofwoe4d ago

Ok, technically not the programming languages, but their stdlibs. On MSVC at least, opening a file in text mode via fopen will translate CRLF into LF on read, and LF into CRLF on write, which has been a neverending source of confusion since at least the 1990s.

EvanAnderson4d ago· 3 in thread

I did a project to translate data framed in the ASCII field/record separator characters and it was gloriously easy. All the ugly escaping considerations with comma-delimited data went away and it became much easier.

ptx4d ago

What happens when the data contains the record or field separator characters?

I suppose you could document that it's unsupported, and just drop or reject such values, but then the system couldn't be used to handle test data for such systems, for example.

EvanAnderson4d ago

In the case of this system (a quasi-EDI interface used to move records from a fleet fueling point-of-sale system to the ERP software) those characters were forbidden by the source application. My code would have exploded in a fireball if they had been present, but the specification said they couldn't be.

bobmcnamara4d ago

Easy - don't

Parodper4d ago· 3 in thread

LF makes the most sense, but they're all fine for text files. The issue is that CSV isn't text.

Last time I had to handle CSV files in bash, I converted them internally to RS and FS.

Ekaros4d ago

Line feed resetting position really makes no sense. It should just continue text from where the cursor was but on next line. Like staircase. You need CR to go back to start.

Parodper4d ago

Yes, if you're talking to a terminal. But an in-disk file doesn't have a carriage to return.

ryandrake4d ago

Modern computer text output devices don’t have a “carriage” or a “feed” mechanism. I’d argue both CR and LF are legacy, anachronistic characters whose purpose was too device specific to make sense as a text encoding.

1 more reply

brewmarche4d ago

Now with Unicode we actually have even more:

NL Next line (from EBCDIC?)

LS Line separator (invented by Unicode)

PS Paragraph separator (same)

The Unicode standard says that in addition to CR, LF, CRLF and the above, vertical tabs and form feeds should also be treated as line separators.

codedokode4d ago

> non UTF-8 encoded strings on input/output

I would just use UTF-8 everywhere.

j / k navigate · click thread line to collapse

0 comments

22 comments · 5 top-level

flohofwoe4d ago· 11 in thread

> non UTF-8 encoded strings on input/output

UTF-8 on stdin/stdout works perfectly fine (unless you are on Windows of course, which is stuck in in the early 90s when it comes to international text encoding).

> Using LF or CR or CRLF as line terminators

rswailOP4d ago

No, it was an Apple, Unix, and Microsoft problem.

Unix used LF, Apple used CR, Microsoft used CRLF.

They are all ASCII carriage movement codes, which is about driving the paper feed and print head of an ASR-33 or equivalent.

So they all made the "wrong" decision about what to store in a file.

They just chose different wrong characters.

flohofwoe4d ago

> Apple used CR

In the end it's not the job of programming languages to clean up Microsoft's mess ;)

rswailOP3d ago

We're literally talking about two decades before that.

bobmcnamara4d ago

The switch sure sucked though. I doubt Microsoft would risk their reputation for backwards compatibility.

parineum4d ago

> In the end it's not the job of programming languages to clean up Microsoft's mess ;)

Why is it Microsoft's fault? They just stayed on their legacy implementation, Linux and Apple chose to move from the legacy implementation to another legacy implementation. That seems dumb.

kps4d ago

> They just chose different wrong characters.

¹ doi/10.1093/comjnl/7.3.197

Parodper4d ago

UNIX's LF precedes them by at least half a decade, probably more.

bobmcnamara4d ago

CRLR is Baudot, predating UNIX by what, a century ?

1 more reply

JdeBP3d ago

rswail said ASCII, which definitely pre-dated Unix, not the other way around. And there was some to and fro about the equivalence of LineFeed and NewLine in the 1960s.

skywhopper4d ago

What programming languages try to guess line endings? Or are even aware of them?

flohofwoe4d ago

EvanAnderson4d ago· 3 in thread

ptx4d ago

What happens when the data contains the record or field separator characters?

I suppose you could document that it's unsupported, and just drop or reject such values, but then the system couldn't be used to handle test data for such systems, for example.

EvanAnderson4d ago

bobmcnamara4d ago

Easy - don't

Parodper4d ago· 3 in thread

LF makes the most sense, but they're all fine for text files. The issue is that CSV isn't text.

Last time I had to handle CSV files in bash, I converted them internally to RS and FS.

Ekaros4d ago

Line feed resetting position really makes no sense. It should just continue text from where the cursor was but on next line. Like staircase. You need CR to go back to start.

Parodper4d ago

Yes, if you're talking to a terminal. But an in-disk file doesn't have a carriage to return.

ryandrake4d ago

1 more reply

brewmarche4d ago

Now with Unicode we actually have even more:

NL Next line (from EBCDIC?)

LS Line separator (invented by Unicode)

PS Paragraph separator (same)

The Unicode standard says that in addition to CR, LF, CRLF and the above, vertical tabs and form feeds should also be treated as line separators.

codedokode4d ago

> non UTF-8 encoded strings on input/output

I would just use UTF-8 everywhere.

j / k navigate · click thread line to collapse