undefined | Better HN

story

0 pointsAvernar8y ago0 comments

What do you mean by "expose UTF-8"? Because nothing about UTF-8 requires that you give byte access to the string.

As for indexing, strings shouldn't require indexing period. That's the ASCII way of thinking, especially fixed width columns and such. You should be thinking relatively. For example, find me the first space then using that point in the string the next character needs to be letter. When you build you're code that way you don't fall for the trap of byte indexing or the performance hit of codepoint indexing (UTF-8) or grapheme indexing (all encodings).

0 comments

ubernostrum8y ago

There are real-world textual data types for which your idealized approach simply does not work. As in, it would be impossible or impossibly unwieldy to validate conformance to the type using your approach, because they require indexing to specific locations, or determining length, or both.

For example, I work for a company that does business in the (US) Medicare space. Every Medicare beneficiary has a HICN -- Health Insurance Claim Number -- and HICNs come in different types which need to be identified. Want to know how to identify them? By looking at prefix and suffix characters in specific positions, and the length of what comes between them. For example, the prefix 'A' followed by six digits means the person identified is the primary beneficiary and was first covered under the Railroad Retirement Board benefit program. Doing this without indexing and length operations is madness.

These data types can and should be subjected first to some basic checks to ensure they're not nonsense (i.e., something expected to be a numeric value probably should not contain Linear B code points, and it's probably a good idea to at least throw a regex at it first, but then applying regex to Unicode also has quirks people don't often expect at first...).

AvernarOP8y ago

I don't see why this would be hard with iterators. You have an iterstor to the start of the HICN, either at the start of a or deep in the string. Take a second iterator and set it to the first. Loop six times advancing that iterator checking to see if it's a digit. Then check if the next position is a space.

For the prefix and suffix and how many characters between them you do the above but use the second iterator to find the suffix. Then you either keep track of how many characters you advanced or ask for how many characters between the two.

It's very easy to think about it this way as that's how a normal (non programmer) human would do it. Basically the code literally does what you wrote in english above.

My point being is that iterators are much faster than indexing when the underlying string system uses graphemes. You can do pretty much anyting just as easy or easier with iterators than with indexing. The big exception is fixed width columnar tet files. I've seen a lot of these in financial situations but fortuanately those systems are ASCII based so not an issue.

ubernostrum8y ago

You're not really changing anything, though; you're basically saying that instead of indexing to position N, you're going to take an iterator and advance it N positions, and somehow say that's a completely different operation. It isn't a different operation, and doesn't change anything about what you're doing.

If you want to argue that there should be ways to iterate over graphemes and index based on graphemes, then that is a genuine difference, but splitting semantic hairs over whether you're indexing or iterating doesn't get you a solution.

AvernarOP8y ago

If the string is stored as ASCII characters or Unicode code points (UCS-16 or UCS-32) then you are correct that not much changes. But if the string is in UTF-8, UTF-16 or the string system uses graphemes then indexing goes from O(1) to O(N). Every index operation would have to start a linear scan from the beginning of the string to get to the correct spot. With iterators it would be a quick operation to access what it's pointing to and very quick to advance it.

My argument is that iterators are far superior to indexing when using graphemes (or code points stored as UTF-8 but grapheme support is superior). And they don't hurt when used on ASCII or fixed width strings either so the code will work with either string format. No hairs, split or otherwise here.

1 more reply

mjevans8y ago

int_19h's approach is still valid for this; you're asking for whole displayed characters which are combined of some (you don't need to know) number of bits in memory across several units of the memory segment(s) that hold the string.

Based on your description, the correct solution is probably to use a structure or class of a more regular format to store the decoded HICN in pre-broken form. If they really only allow numbers in runs of text you might save space and speed comparison/indexing by doing this.

ubernostrum8y ago

It's more that I get tired of people declaring that indexing and length operations need to be completely and utterly and permanently forbidden and removed, and then proposing that they be replaced by operations which are equivalent to indexing and length operations.

Doing these operations on sequences of code points can be perfectly safe and correct, and in 99.99%+ of real-world cases probably will be perfectly safe and correct. My preference is for people to know what the rare failure cases are, and to teach how to watch out for and handle those cases, while the other approach is to forbid the 99.99% case to shut down the risk of mis-handling the 0.001% case.

mjevans8y ago

When people say they should be removed they mean primitive operations (like a standard 'length' attribute/function, or an array index operator) shouldn't exist for that type.

Just like it is better to have something like .nth(X) as a function for stepping to a numbered node, so to does a language string demand operations like .nth_printing(X) .nth_rune(X) and .nth_octet(X); to make it clear to any programmer working with that code what the intent is.

AvernarOP8y ago

Semantically equivalent yes, access time equivalent for variable width strings no. One of the reasons for Python 3's odd internal string format is because they wanted to keep indexing and have indexing be O(1). The reason why I think replacing indexing with iterators is that it removes this restriction and they could have made the internal format UTF-8 and/or easily added support for graphemes.

I prefer to have a system where 100% of the cases are valid and teaching people corner cases is not required. We all know how well teaching people about surrogate pairs went. And we're not forbidding the 99.99% case but providing an alternative way to accomplish the exact same thing. The vast majority of code uses index variables as a form of iterator anyways so it's not that big of a change.

The main reason people keep clinging to indexing strings is that's all they know. Most high level languages don't provide another way of doing it. People who program in C quickly switch from indexing to pointers into strings. Give a C programmer an iterator into strings and they'll easily handle it.

int_19h8y ago

By "expose UTF-8" I mean exposing the underlying UTF-8 representation of the string directly on the object itelf, instead of going through a separate byte array (or byte array view, to avoid copying)

AvernarOP8y ago

Ah, I see. I agree that it would be a bad idea to give acess to the UTF-8 representstion.

As for length in bytes, a good way to handle most use cases regarding that is to have a function that truncates the string to fit into a certain number of bytes. That way you can make sure it fits into whatever fixed buffer and the truncation would happen on a grapheme level.

j / k navigate · click thread line to collapse

0 comments

ubernostrum8y ago

AvernarOP8y ago

It's very easy to think about it this way as that's how a normal (non programmer) human would do it. Basically the code literally does what you wrote in english above.

ubernostrum8y ago

AvernarOP8y ago

1 more reply

mjevans8y ago

ubernostrum8y ago

mjevans8y ago

When people say they should be removed they mean primitive operations (like a standard 'length' attribute/function, or an array index operator) shouldn't exist for that type.

AvernarOP8y ago

int_19h8y ago

By "expose UTF-8" I mean exposing the underlying UTF-8 representation of the string directly on the object itelf, instead of going through a separate byte array (or byte array view, to avoid copying)

AvernarOP8y ago

Ah, I see. I agree that it would be a bad idea to give acess to the UTF-8 representstion.

j / k navigate · click thread line to collapse