- An ulid is "sortable". But the whole point of an UUID is a random unique ID. Non guessable. Sortable is not a feature
you normally want from an uuid. And still UUIDs are still sortable. But it doesnt have any meaning.
- An ulid also encodes Time, an uuid doesn't.
- An uuid has less change of clashing: Its 128 bit vs 80 bit for ulid.
- An uuid is also url safe.
- An ulid is case insensitive. I don't see how this is an advantage.As for less chance of clashing, 128 vs 80 bits. 80 bits is only within the same millisecond. The 128 bits is for the lifetime of your application. Based on my basic understanding of the birthday paradox, there is a 50% chance of collision if you have 2^40 ULIDs generated during the same millisecond. And there is a 50% chance of collision if you have 2^64 UUIDs generated during the lifetime of the application.
Case insensitivity and the chosen character set is an advantage if you want to use this ID as a filename without having to worry about the limits of the filesystem.
It's helpful if you've ever had to read something like a ULID over the phone. Base64 isn't fun in that situation.
Actually... that's 80 bits for the "random" part of the ULID, which only disambiguates within the same millisecond.
Why so? As the name says they just need to be universally unique. Even relying purely on randomness seems bad, because you are at the mercy of the entropy source. Mess that up and you can end up with nasty problems throughout your system.
Can anyone please ellaborate? I don't get the confusion or abuse potential
I, L, and O can be confused with the digits 1, 1, and 0, respectively.
He omits U to avoid "accidental obscenity."
As for U, I don't know about that, but I recently stumbled upon the issue that Google's OCR software seems to incorrectly recognize U as II relatively often.
As for the 'U', I'm not sure why the original ULID spec left it out. Thanks for raising this. I'll investigate.
See http://www.crockford.com/wrmg/base32.html, Ctrl-F, search for obscenity.
Edit: Interestingly, since it already omits the letter I, adding U covers all of George Carlins' 7 Dirty Words: https://en.wikipedia.org/wiki/Seven_dirty_words
Sibling posts explain why better. This isn't something that you forget once it has bitten you.
We came to the conclusion that universal ids should represent identity only, and explicitly not have 'metadata'. What is the use of 48 bites of time? It reduces the overall entropy, for what? If time is important then why not make the id literally time (i.e. UnixNano), if it isn't they why not make all bits rand?
Also, while I think speak-ability is actually very important (many disagree), I'd assert that capitalization is better addressed from the opposite direction (i.e. UI). Instead of removing capital letters from the ID itself affecting all cases, address the human factors in the few cases it comes up.
I think this is good advice in general:
When spoken aloud we suggest that you don't indicate capitalization at first. In many cases, such as search, human validation etc, this is more than enough precision, then one can add capitalization for disambiguation as needed.
For example:
"ab2Cd3Ef1g"
Spoken becomes: "a b two c d three e f one g. Capitalize 4 and 7, c and e"
With the capitalization part optional depending on context.
For some designs it's useful to have identifiers have other properties than uniqueness. In this case, this property is relative lexicographical (and binary) order based on time so that you can leverage the order between the things the identifiers identify without looking at the things. The entropy is there to satisfy the uniqueness property (with some acceptable degree of collision, application dependent). The time is there to satisfy the ordering property.
Some applications may be able to tolerate inconsistencies in ordering, others may not. Are IDs being generated on multiple machines? Are they in sync? What happens if the system clock is adjusted, or a container/VM is restarted on different hardware?
This design implies that these IDs are being generated in different locations, but this usage leads to the least reliable time. How many bits of approximate time does one really need? Not 48 surly.
On the other hand if you generate the IDs in the most reliable model, a single host with persistent storage to prevent regression, you've basically made an unnecessarily complicated vector clock. A simple incremental counter would work at least as well, and be far simpler.
As you said, don't include information beyond identity for the simple reason that nothing is guaranteed to never change, every information you put in there may become outdated. One may think you can surly put the birth country of the customer into its customer number, but what if the customer made a mistake when signing up?
It may be useful to have some information in identifiers, for example a prefix C or O to indicate a customer or order number, but that can be implicit most of the time and just be added in the user interface or when printing labels.
To guarantee global uniqueness one has essentially two choices, enough randomness or unique identifiers for space and time. The first option requires no coordination, not synchronizing clocks and not making sure every process has a unique identifier, the second option allows you to have locality in identifiers generated at similar times or places.
Locality can be useful when identifiers are used as keys in databases or other data structures because it may avoid having every insert happen at a random place. On the other hand this may easily leak information, whether the number of your customers or orders per month or the number of servers you are using, especially if you are using counters, see further down.
One can use identifiers with locality internally but encrypt them before exposing them to the outside, but this comes with an entire new set of problems. One may have to include initialization vectors in the identifiers possibly making them longer. One may have to think about key rotations, at least for the case keys get compromised, possibly making identifiers even longer having to include key identifiers. One will have to do key management anyway and encryption of course costs clock cycles.
If one wants to include a time component, one can choose between actual time and counters. If one uses actual time, you have to ensure your clock never goes backwards and you may have to account for limited resolution, possibly falling back to counters or randomness when identifiers are produced faster than the clock ticks. If one use counters, you have to ensure it never runs backwards, for example persisting it across application restarts including crashes.
If several processes share counters, synchronization overhead between cores may become an issue in high performance scenarios. Modern processors offer high frequency hardware timestamp counters that can be synchronized across cores. This may be especially interesting if the timestamps are also used for purposes like operation or transaction ordering.
Identifiers should be readable if only for debugging by developers. Similar characters like O, 0, 1, I, l, U and V should be avoided. One can consider a checksum or even error correcting codes for user facing identifiers.
Capitalization is a tradeoff between length and ease of use but I like your idea of dealing with it behind the scenes. One probably loses on average less than one bit per character when ignoring capitalization so that for sufficiently long identifiers collisions are still very improbable, at least if one does not use identifiers with a lot of locality. But I fear users will just tend to include the capitalization because they are not aware it can probably be ignored. It surly adds complexity to the application, if only to the user interface, because you have always to be able to handle collisions.
.. and it appears to be working
It took me a little more than 5 minutes, so it wasn't that hard after all. It appears to run at only about 35% the speed of the Go version, but hopefully there's still room for some little improvement.
If it's part of a spec, then it's simpler, but it's still a potential point of surprise.
[1] https://connect.microsoft.com/SQLServer/feedback/details/475...