Universally Unique Lexicographically Sortable Identifier in Go (opens in new tab)

(github.com)

89 pointstsenart9y ago52 comments

52 comments

37 comments · 7 top-level

TeeWEE9y ago· 9 in thread

Note. You cant compare ULID with UUID's

  - An ulid is "sortable". But the whole point of an UUID is a random unique ID. Non guessable. Sortable is not a feature 
    you normally want from an uuid. And still UUIDs are still sortable. But it doesnt have any meaning. 
  - An ulid also encodes Time, an uuid doesn't. 
  - An uuid has less change of clashing: Its 128 bit vs 80 bit for ulid.
  - An uuid is also url safe.
  - An ulid is case insensitive. I don't see how this is an advantage.

_ak9y ago

You just compared UUIDs with ULIDs.

cloudhead9y ago

Maybe he means programatically?

1 more reply

sp3329y ago

That's contrasting, not comparing.

1 more reply

clinta9y ago

If you sort a list of ULIDs they sorted by time of creation, with random order only among ULIDs created in the same millisecond. As far as guessable, you still have 80 random bits. Though I'll admit I'm not qualified to comment on whether the difference between 80 and 128 bits is significant when you're trying to make sure your identifiers are not guessable.

As for less chance of clashing, 128 vs 80 bits. 80 bits is only within the same millisecond. The 128 bits is for the lifetime of your application. Based on my basic understanding of the birthday paradox, there is a 50% chance of collision if you have 2^40 ULIDs generated during the same millisecond. And there is a 50% chance of collision if you have 2^64 UUIDs generated during the lifetime of the application.

Case insensitivity and the chosen character set is an advantage if you want to use this ID as a filename without having to worry about the limits of the filesystem.

tyingq9y ago

>An ulid is case insensitive. I don't see how this is an advantage.

It's helpful if you've ever had to read something like a ULID over the phone. Base64 isn't fun in that situation.

jdmichal9y ago

I've never seen a UUID in base-64 instead of hex...

1 more reply

vertex-four9y ago

> An uuid has less change of clashing: Its 128 bit vs 80 bit for ulid.

Actually... that's 80 bits for the "random" part of the ULID, which only disambiguates within the same millisecond.

tsenartOP9y ago

ULIDs are meant to be compared with Time UUIDs in particular. See: https://www.famkruithof.net/guid-uuid-timebased.html

foobarian9y ago

> But the whole point of an UUID is a random unique ID. Non guessable. Sortable is not a feature you normally want from an uuid.

Why so? As the name says they just need to be universally unique. Even relying purely on randomness seems bad, because you are at the mercy of the entropy source. Mess that up and you can end up with nasty problems throughout your system.

cfv9y ago· 7 in thread

> This alphabet excludes the letters I, L, O, and U to avoid confusion and abuse.

Can anyone please ellaborate? I don't get the confusion or abuse potential

pradn9y ago

Douglas Crockford goes into why he chose those 4 letters in this page: http://www.crockford.com/wrmg/base32.html.

I, L, and O can be confused with the digits 1, 1, and 0, respectively.

He omits U to avoid "accidental obscenity."

_ak9y ago

Not the author, but presumably because 1, I and l are visually hard to distinguish in some fonts. Same goes for O and 0.

As for U, I don't know about that, but I recently stumbled upon the issue that Google's OCR software seems to incorrectly recognize U as II relatively often.

cloudhead9y ago

I'm guessing the following pairs can be confused visually: O/0, I/l, U/V..

tsenartOP9y ago

Depending on the font used, 'I' and 'L' can be easily confused by humans. 'O' can be read as '0' too. This is meant to prevent that.

As for the 'U', I'm not sure why the original ULID spec left it out. Thanks for raising this. I'll investigate.

tyingq9y ago

Typically, 'U' is omitted to avoid accidentally putting the word FUCK into the generated output. No, not kidding.

See http://www.crockford.com/wrmg/base32.html, Ctrl-F, search for obscenity.

Edit: Interestingly, since it already omits the letter I, adding U covers all of George Carlins' 7 Dirty Words: https://en.wikipedia.org/wiki/Seven_dirty_words

1 more reply

Guvante9y ago

Well you need to leave 4 characters out to get to 32, so unless there is a better character the selection is arbitrary.

taeric9y ago

I think this is a problem that was accented quite well for people that ever had to use passwords in video games.

Sibling posts explain why better. This isn't something that you forget once it has bitten you.

joshuak9y ago· 4 in thread

I've been working on a closely related problem of universal identification in distributed computing for a few years now. It's now in standards review and hopefully publishable soon.

We came to the conclusion that universal ids should represent identity only, and explicitly not have 'metadata'. What is the use of 48 bites of time? It reduces the overall entropy, for what? If time is important then why not make the id literally time (i.e. UnixNano), if it isn't they why not make all bits rand?

Also, while I think speak-ability is actually very important (many disagree), I'd assert that capitalization is better addressed from the opposite direction (i.e. UI). Instead of removing capital letters from the ID itself affecting all cases, address the human factors in the few cases it comes up.

I think this is good advice in general:

When spoken aloud we suggest that you don't indicate capitalization at first. In many cases, such as search, human validation etc, this is more than enough precision, then one can add capitalization for disambiguation as needed.

For example:

"ab2Cd3Ef1g"

Spoken becomes: "a b two c d three e f one g. Capitalize 4 and 7, c and e"

With the capitalization part optional depending on context.

tsenartOP9y ago

> What is the use of 48 bites of time? It reduces the overall entropy, for what? If time is important then why not make the id literally time (i.e. UnixNano), if it isn't they why not make all bits rand?

For some designs it's useful to have identifiers have other properties than uniqueness. In this case, this property is relative lexicographical (and binary) order based on time so that you can leverage the order between the things the identifiers identify without looking at the things. The entropy is there to satisfy the uniqueness property (with some acceptable degree of collision, application dependent). The time is there to satisfy the ordering property.

joshuak9y ago

Yes very good point. However, as @danbruc rightly points out this raises all sorts of other concerns. A user of these IDs my not realize that the reliability of the ordering can be substantially reduced depending on where the IDs are generated.

Some applications may be able to tolerate inconsistencies in ordering, others may not. Are IDs being generated on multiple machines? Are they in sync? What happens if the system clock is adjusted, or a container/VM is restarted on different hardware?

This design implies that these IDs are being generated in different locations, but this usage leads to the least reliable time. How many bits of approximate time does one really need? Not 48 surly.

On the other hand if you generate the IDs in the most reliable model, a single host with persistent storage to prevent regression, you've basically made an unnecessarily complicated vector clock. A simple incremental counter would work at least as well, and be far simpler.

1 more reply

Glyptodon9y ago

I'm not sure how I feel about time, but I feel like case sensitivity is asking for trouble if you're expecting anything to be vocalized/shorthand-written, particularly without knowing individual use cases. It also seems like the sort of thing that might occur so infrequently nobody remembers to cover the "fringe" case, particularly for things a couple layers removed, assuming the set size is as huge as it seems intuitively.

danbruc9y ago

Having also spent some time on the topic, let me share my thoughts.

As you said, don't include information beyond identity for the simple reason that nothing is guaranteed to never change, every information you put in there may become outdated. One may think you can surly put the birth country of the customer into its customer number, but what if the customer made a mistake when signing up?

It may be useful to have some information in identifiers, for example a prefix C or O to indicate a customer or order number, but that can be implicit most of the time and just be added in the user interface or when printing labels.

To guarantee global uniqueness one has essentially two choices, enough randomness or unique identifiers for space and time. The first option requires no coordination, not synchronizing clocks and not making sure every process has a unique identifier, the second option allows you to have locality in identifiers generated at similar times or places.

Locality can be useful when identifiers are used as keys in databases or other data structures because it may avoid having every insert happen at a random place. On the other hand this may easily leak information, whether the number of your customers or orders per month or the number of servers you are using, especially if you are using counters, see further down.

One can use identifiers with locality internally but encrypt them before exposing them to the outside, but this comes with an entire new set of problems. One may have to include initialization vectors in the identifiers possibly making them longer. One may have to think about key rotations, at least for the case keys get compromised, possibly making identifiers even longer having to include key identifiers. One will have to do key management anyway and encryption of course costs clock cycles.

If one wants to include a time component, one can choose between actual time and counters. If one uses actual time, you have to ensure your clock never goes backwards and you may have to account for limited resolution, possibly falling back to counters or randomness when identifiers are produced faster than the clock ticks. If one use counters, you have to ensure it never runs backwards, for example persisting it across application restarts including crashes.

If several processes share counters, synchronization overhead between cores may become an issue in high performance scenarios. Modern processors offer high frequency hardware timestamp counters that can be synchronized across cores. This may be especially interesting if the timestamps are also used for purposes like operation or transaction ordering.

Identifiers should be readable if only for debugging by developers. Similar characters like O, 0, 1, I, l, U and V should be avoided. One can consider a checksum or even error correcting codes for user facing identifiers.

Capitalization is a tradeoff between length and ease of use but I like your idea of dealing with it behind the scenes. One probably loses on average less than one bit per character when ignoring capitalization so that for sufficiently long identifiers collisions are still very improbable, at least if one does not use identifiers with a lot of locality. But I fear users will just tend to include the capitalization because they are not aware it can probably be ignored. It surly adds complexity to the application, if only to the user interface, because you have always to be able to handle collisions.

pritambarhate9y ago· 4 in thread

Will the ulid be unique even if it is generated from 2 different machines at the same millisecond?

drfuchs9y ago

Yes, if your source of entropy is worth its salt.

vlowther9y ago

That was bad and you should feel bad. :)

tsenartOP9y ago

Yes if you use a good source of entropy. In Go, the easiest and most secure is the `crypto/rand.Reader`. But it's also slow. If you don't need the security properties, an instance of `math/rand.Rand` should suffice. Otherwise, you can implement your own. It's just an io.Reader!

Freak_NL9y ago

Doesn't that mean that it will usually — but not guaranteed always — be unique? That is, won't there always be a (minute and negligible depending on the intended use) chance of collision?

1 more reply

Dowwie9y ago· 2 in thread

Seems that someone ported the javascript ulid library to Python: https://github.com/mdipierro/ulid

.. and it appears to be working

aruggirello9y ago

Out of curiosity, I just wrote a PHP ulid() function. [1]

It took me a little more than 5 minutes, so it wasn't that hard after all. It appears to run at only about 35% the speed of the Go version, but hopefully there's still room for some little improvement.

[1] https://github.com/phptools/ulid

tsenartOP9y ago

Fellow PHP implementation: https://github.com/Lewiscowles1986/ulid

lobster_johnson9y ago· 2 in thread

I wonder why you would ever want it to be case-insensitive? Surely that increases the possibility of clashes, and violates the very good principle of least surprise (pretty much every other ID in the world is case-sensitive).

quink9y ago

The hex representation of GUIDs/UUIDs is case insensitive in most interpretations, I'd say that counts enough.

lobster_johnson9y ago

I'd be surprised if many developers these days actually compared UUIDs case-insensitively, though, even if they're intended to be. Microsoft, which have relied on GUIDs for years going back to their version of DCE/RPC, probably does it right.

If it's part of a spec, then it's simpler, but it's still a potential point of surprise.

lazulicurio9y ago· 2 in thread

Word of caution to anybody looking to use a similar method working with SQL Server: the SQL Server uniqueidentifier type is stored differently on disk than binary(16). Instead of being ordered by bytes 0-1-2-3-4-5-6-7-8-9-10-11-12-13-14-15, uniqueidentifier values are ordered by bytes 10-11-12-13-14-15-8-9-7-6-5-4-3-2-1.

zamalek9y ago

MSSQL does have a similar identifier, though - NEWSEQUENTIALID.

lazulicurio9y ago

Yes, but NEWSEQUENTIALID isn't globally ordered, it's only ordered for the computer on which it was generated. For that matter, on a single computer the ordering isn't guaranteed[1]. This makes NEWSEQUENTIALID fairly useless if you're looking to merge records from disparate sources and sort them independently of origin.

[1] https://connect.microsoft.com/SQLServer/feedback/details/475...

1 more reply

j / k navigate · click thread line to collapse

52 comments

37 comments · 7 top-level

TeeWEE9y ago· 9 in thread

Note. You cant compare ULID with UUID's

  - An ulid is "sortable". But the whole point of an UUID is a random unique ID. Non guessable. Sortable is not a feature 
    you normally want from an uuid. And still UUIDs are still sortable. But it doesnt have any meaning. 
  - An ulid also encodes Time, an uuid doesn't. 
  - An uuid has less change of clashing: Its 128 bit vs 80 bit for ulid.
  - An uuid is also url safe.
  - An ulid is case insensitive. I don't see how this is an advantage.

_ak9y ago

You just compared UUIDs with ULIDs.

cloudhead9y ago

Maybe he means programatically?

1 more reply

sp3329y ago

That's contrasting, not comparing.

1 more reply

clinta9y ago

Case insensitivity and the chosen character set is an advantage if you want to use this ID as a filename without having to worry about the limits of the filesystem.

tyingq9y ago

>An ulid is case insensitive. I don't see how this is an advantage.

It's helpful if you've ever had to read something like a ULID over the phone. Base64 isn't fun in that situation.

jdmichal9y ago

I've never seen a UUID in base-64 instead of hex...

1 more reply

vertex-four9y ago

> An uuid has less change of clashing: Its 128 bit vs 80 bit for ulid.

Actually... that's 80 bits for the "random" part of the ULID, which only disambiguates within the same millisecond.

tsenartOP9y ago

ULIDs are meant to be compared with Time UUIDs in particular. See: https://www.famkruithof.net/guid-uuid-timebased.html

foobarian9y ago

> But the whole point of an UUID is a random unique ID. Non guessable. Sortable is not a feature you normally want from an uuid.

cfv9y ago· 7 in thread

> This alphabet excludes the letters I, L, O, and U to avoid confusion and abuse.

Can anyone please ellaborate? I don't get the confusion or abuse potential

pradn9y ago

Douglas Crockford goes into why he chose those 4 letters in this page: http://www.crockford.com/wrmg/base32.html.

I, L, and O can be confused with the digits 1, 1, and 0, respectively.

He omits U to avoid "accidental obscenity."

_ak9y ago

Not the author, but presumably because 1, I and l are visually hard to distinguish in some fonts. Same goes for O and 0.

As for U, I don't know about that, but I recently stumbled upon the issue that Google's OCR software seems to incorrectly recognize U as II relatively often.

cloudhead9y ago

I'm guessing the following pairs can be confused visually: O/0, I/l, U/V..

tsenartOP9y ago

Depending on the font used, 'I' and 'L' can be easily confused by humans. 'O' can be read as '0' too. This is meant to prevent that.

As for the 'U', I'm not sure why the original ULID spec left it out. Thanks for raising this. I'll investigate.

tyingq9y ago

Typically, 'U' is omitted to avoid accidentally putting the word FUCK into the generated output. No, not kidding.

See http://www.crockford.com/wrmg/base32.html, Ctrl-F, search for obscenity.

Edit: Interestingly, since it already omits the letter I, adding U covers all of George Carlins' 7 Dirty Words: https://en.wikipedia.org/wiki/Seven_dirty_words

1 more reply

Guvante9y ago

Well you need to leave 4 characters out to get to 32, so unless there is a better character the selection is arbitrary.

taeric9y ago

I think this is a problem that was accented quite well for people that ever had to use passwords in video games.

Sibling posts explain why better. This isn't something that you forget once it has bitten you.

joshuak9y ago· 4 in thread

I've been working on a closely related problem of universal identification in distributed computing for a few years now. It's now in standards review and hopefully publishable soon.

I think this is good advice in general:

For example:

"ab2Cd3Ef1g"

Spoken becomes: "a b two c d three e f one g. Capitalize 4 and 7, c and e"

With the capitalization part optional depending on context.

tsenartOP9y ago

joshuak9y ago

This design implies that these IDs are being generated in different locations, but this usage leads to the least reliable time. How many bits of approximate time does one really need? Not 48 surly.

1 more reply

Glyptodon9y ago

danbruc9y ago

Having also spent some time on the topic, let me share my thoughts.

pritambarhate9y ago· 4 in thread

Will the ulid be unique even if it is generated from 2 different machines at the same millisecond?

drfuchs9y ago

Yes, if your source of entropy is worth its salt.

vlowther9y ago

That was bad and you should feel bad. :)

tsenartOP9y ago

Freak_NL9y ago

Doesn't that mean that it will usually — but not guaranteed always — be unique? That is, won't there always be a (minute and negligible depending on the intended use) chance of collision?

1 more reply

Dowwie9y ago· 2 in thread

Seems that someone ported the javascript ulid library to Python: https://github.com/mdipierro/ulid

.. and it appears to be working

aruggirello9y ago

Out of curiosity, I just wrote a PHP ulid() function. [1]

[1] https://github.com/phptools/ulid

tsenartOP9y ago

Fellow PHP implementation: https://github.com/Lewiscowles1986/ulid

lobster_johnson9y ago· 2 in thread

quink9y ago

The hex representation of GUIDs/UUIDs is case insensitive in most interpretations, I'd say that counts enough.