ULID – Sortable Unique Identifier (opens in new tab)

(yadukrishnan.live)

33 pointsorobinson3y ago23 comments

23 comments

This reminds me a little bit of Twitter's snowflake: To generate the roughly-sorted 64 bit ids in an uncoordinated manner, we settled on a composition of: timestamp, worker number and sequence number.

Sequence numbers are per-thread and worker numbers are chosen at startup via zookeeper (though that’s overridable via a config file).

https://blog.twitter.com/engineering/en_us/a/2010/announcing...

9dev3y ago

Is that still a viable strategy or are there better options available by now? Is Twitter still using Snowflake? (There’s a nice Elon pun in there somewhere, I’m sure)

preillyme3y ago

The initial version, released in 2010, was based on Apache Thrift and it predated Finagle, their building block for RPC services at Twitter. The Snowflake they're using internally is a full rewrite and heavily relies on existing infrastructure at Twitter to run.

sophacles3y ago

Or just use the standards-track orderable UUID variants: https://uuid6.github.io/uuid6-ietf-draft/

mbStavola3y ago

Wouldn't it be better for it to become an actual standard before using the new variants? If something changes between now and when it gets standardized, you'd have to deal with the difference going forward.

pwinnski3y ago

And in the meantime, ULID isn't even that far along, and is there even more likely to change.

WorldMaker3y ago

The ULID spec has been stable for many years now and changes are extremely unlikely. The spec wasn't designed for forwards compatibility for future specs, which on the one hand makes it a possible risk if there are future specs, but also gives some weight to the "promise" that future specs are not expected and will be backwards incompatible anyway.

It would be nice to see ULID recognized by some standards body that isn't just "itself", but also unlike UUID version changes doesn't need to be. The new UUID versions need to be backward and forward compatible with existing standards documents "universally", whereas ULID is designed to be entirely self-contained.

1 more reply

vbezhenar3y ago

> If more than one ULID is generated within the same millisecond(so, the timestamp component is the same), the algorithm should increment the previously generated random by 1 bit.

This is bad approach. If you know one ULID, you can with high probability deduce next one. Don't use this approach.

Why with high probability? Because generating several ids within the same millisecond is extremely common case when you're doing batching.

I recently gave thought to it and actually implemented several algorithms and compared them each to each other. I don't care about standards, sorry (I think that standard UUID is oxymoron, UUID is 128 bit and that's about it).

So the best approach I've found is:

6 bits for version/variant (if you don't care about standards you can use those bits for better randomness)

first 48 bits is unix milliseconds.

Then you have 12 more bits in the fist 64 bits. There're two approaches to use them:

either use them as a nanoseconds (nanoseconds_part * 4096 / 1_000_000).

Or use them as a counter if several ids are generated in the same millisecond. It allows for 4096 values per millisecond. Counter should be used when you can't access nanosecond timer like with browser JavaScript.

Then you have 2-3 bits for variant and rest 62-61 bits for pure randomness. Or just 64 bits for randomness. This is enough for security.

If you need to generate more than one ID per 1/4 of microsecond (approach one) or more than 4096 ids per millisecond (approach two), you can just keep generating random part until it's greater than previously generated one. It slightly reduces randomness but not by much.

I'm pretty much sure that my approach is the best approach. It allows for high speed of generation (like 10 000 000 / second with nanosecond approach with unoptimized Java) and good security.

If you insist on using this ULID approach, I suggest to apply the aforementioned approach: don't just increment random bits, generate random bits until you've got higher value.

Of course one should only use this ascending UUID thing (that's what I call it) when you need it. Random UUIDs should be used by default and ascending UUIDs only when you use those as primary keys for RDBMS.

And of course keep in mind that you're leaking generation timestamp which might be a bad thing.

WorldMaker3y ago

That monotonic support in the ULID spec is described as an option. The default in most implementations I've seen is pure random ULIDs and monotonic has to be opt-in.

(Though yes, anyone opting into that behavior should beware that it reduces the entropy strength of generated IDs.)

samwillis3y ago

Discussed many time before: https://hn.algolia.com/?q=ulid

erik_seaberg3y ago

A lot of distributed systems’ clocks can’t tell you within 1 ms when an event happened, leading to assumptions that aren’t true. I don’t think we should commit to IDs being even partially ordered unless we have a set of monotonic clocks with sequence number generators, and record a timestamp, a sequence number, and which clock they came from.

WorldMaker3y ago

It of course depends on your application needs. Many applications don't need a total order, they just need a predictable, stable order. ULIDs offer that. ULIDs generated within the same clock ms are sorted randomly. Random is unpredictable to people, but to a system that is a predictable sort order (it is randomly sorted) and the random numbers in ULID are more importantly still a stable sort order.

Obviously, applications exist where you do need a total order of IDs/events and ULID may not be the best choice for those, but don't underestimate the usage scenarios of partially ordered, stable sorts.

erik_seaberg3y ago

People want to believe that sorting ULIDs is semantically different than sorting UUIDs, and it’s really hard not to depend on something that looks sort of true, so I wouldn’t want a system to use them unless our architecture guarantees they always (not just mostly) monotonically increase.

WorldMaker3y ago

ULIDs are sorted semantically different from UUIDs. UUIDs have 6+ different sort orders depending on who you ask and which library call you make. All of those sort orders are different from what you'd get just string forms of UUIDs. There are cross-platform sorting endian issues due to the struct order that GUIDs were originally designed to be grouped by.

ULIDs have a single, consistent sort that matches both byte patterns and string representation. That's a huge semantic difference.

Sure, ULIDs make no claims to accurate sorting or total ordering or monoticity beyond a single machine, but ULIDs aren't designed to be a Snowflake/Thrift replacement, they are designed to be a UUID replacement. You are correct that they make no more guarantees than UUIDs, but they don't have to, that was out of scope of their design. I can understand how that makes it less useful for some of your applications, but that doesn't make it not useful for all sort of applications. (Including many applications that once used UUIDs successfully but want something with a cleaner string representation and fewer cross-platform sorting headaches.)

1 more reply

kcartlidge3y ago

This is a good point. Something is either reliably sortable or it is not sortable at all.

To give the appearance of being sortable to those who are less familiar with how they are generated is potentially dangerous and misleading. And whilst it is true to say that these IDs will consistently sort in the same order, that is equally true of standard GUIDs etc - the difference being the latter does not lead people to believe that the order has inherent meaning, which the former does.

It's a little similar to how the designers of Go noticed that people were relying on the ordering of keys in maps matching the order items were added. So range iteration over keys was specifically changed to start from a 'random' point in the sequence (not truly random, but enough to stop people relying on it). They understood that the appearance of consistency without the fact of consistency leads to errors.

They have their uses I'm sure, they just need to be carefully considered and clearly understood uses.

1 more reply

j / k navigate · click thread line to collapse

23 comments

preillyme3y ago

Sequence numbers are per-thread and worker numbers are chosen at startup via zookeeper (though that’s overridable via a config file).

https://blog.twitter.com/engineering/en_us/a/2010/announcing...

9dev3y ago

Is that still a viable strategy or are there better options available by now? Is Twitter still using Snowflake? (There’s a nice Elon pun in there somewhere, I’m sure)

preillyme3y ago

sophacles3y ago

Or just use the standards-track orderable UUID variants: https://uuid6.github.io/uuid6-ietf-draft/

mbStavola3y ago

pwinnski3y ago

And in the meantime, ULID isn't even that far along, and is there even more likely to change.

WorldMaker3y ago

1 more reply

vbezhenar3y ago

> If more than one ULID is generated within the same millisecond(so, the timestamp component is the same), the algorithm should increment the previously generated random by 1 bit.

This is bad approach. If you know one ULID, you can with high probability deduce next one. Don't use this approach.

Why with high probability? Because generating several ids within the same millisecond is extremely common case when you're doing batching.

So the best approach I've found is:

6 bits for version/variant (if you don't care about standards you can use those bits for better randomness)

first 48 bits is unix milliseconds.

Then you have 12 more bits in the fist 64 bits. There're two approaches to use them:

either use them as a nanoseconds (nanoseconds_part * 4096 / 1_000_000).

Then you have 2-3 bits for variant and rest 62-61 bits for pure randomness. Or just 64 bits for randomness. This is enough for security.

I'm pretty much sure that my approach is the best approach. It allows for high speed of generation (like 10 000 000 / second with nanosecond approach with unoptimized Java) and good security.

If you insist on using this ULID approach, I suggest to apply the aforementioned approach: don't just increment random bits, generate random bits until you've got higher value.

And of course keep in mind that you're leaking generation timestamp which might be a bad thing.

WorldMaker3y ago

That monotonic support in the ULID spec is described as an option. The default in most implementations I've seen is pure random ULIDs and monotonic has to be opt-in.

(Though yes, anyone opting into that behavior should beware that it reduces the entropy strength of generated IDs.)

samwillis3y ago

Discussed many time before: https://hn.algolia.com/?q=ulid

erik_seaberg3y ago

WorldMaker3y ago

erik_seaberg3y ago

WorldMaker3y ago

ULIDs have a single, consistent sort that matches both byte patterns and string representation. That's a huge semantic difference.

1 more reply

kcartlidge3y ago

This is a good point. Something is either reliably sortable or it is not sortable at all.

They have their uses I'm sure, they just need to be carefully considered and clearly understood uses.

1 more reply

j / k navigate · click thread line to collapse