Having any information, specifically time information, leaking from your systems may or may not have unanticipated security or business implications. (e.g. knowing when session tokens or accounts are created).
128 bits -> 128 bits
If AES-128 is an acceptable external UUID (and likely an acceptable internal one), then you might as well just stick with a faster RNG.
I’m afraid you won’t be able to ever rotate that key, would you? Since it’s result is externally used as an identifier, you would have to rotate the external identifiers, too.
Why not just hash it with pretty much any hash function?
It is true that now your encryption key is now very long lived and effectively part of your public interface, but depending on your situation that could be an acceptable tradeoff, and there are quite a few pragmatic reasons why that might be true as has been described by other comments.
Edit: you can even do 64bit snowflakes internally to 128bit AES encrypted externally, doesn’t have to be 128-128 obvs
To that end, I think it's neat to be able to improve indexing on UUIDs, but it's not a security solution.
This was used in the war to estimate the number of German tanks based on the sequential IDs
https://en.wikipedia.org/wiki/German_tank_problem
So just for business intelligence you don't want to leak your IDs.
We do a lot of computer vision and in his project, each processed object is assigned a UUID and he wanted to save images to files for each one.
So we took some time to go over various timestamp formats to be embedded into the filename to make the files sort chronologically. UUIDv7 is just spot-on solving our problem. In this use case, there are no real security considerations.
It may not be technically security, but e.g. knowing your competitor just added N products to their shop, might be a security issue for the business.
It hardens, completes and complements other measures.
Examples of every day security using obscurity: every password and encryption key
EDIT: Thanks for the replies.
Ignore above!
Obscurity is the low bit of security. But when it’s convenient, it still helps.
Obscurity can be helpful as part of defence in depth, to reduce the impact when someone does something stupid, or to make it more difficult to extract information that might be helpful as a means to attack the system from another angle.
If you're already thinking about the implications, you can likely ensure people doesn't jump to the conclusion that the IDs can be trusted just because they look complex.
They are compact, don't leak information, and make a good case why k-sortable IDs are unnecessary, or even harmful for performance.
I'm using sequential integers and created_at/updated_at timestamps for internal use, and Cuid2 IDs externally.
> Cuid2 has been audited by security experts and artificial intelligence, and is considered safe to use for use-cases like secret sharing links.
I'm getting some snake oil vibes from this... There absolutely shouldn't be anything like a random ID that is 'too fast' to compute. You might need a rate limit to stay within your collision bounds, but CPU usage is a poor way to do it.
And there is currently no publicly available "artificial intelligence" that would be useful in a security audit, unless you want to call fuzzers "AI".
> One reason for using sequential keys is to avoid id fragmentation, which can require a large amount of disk space for databases with billions of records.
Disk is cheap but not free at higher tiers. But more importantly, record fragmentation means more pages (unless you take the time to do a full table lock and rewrite it, and who’s doing that?) which means more index bloat. I assure you, that adds up once you’re into the billions of records level.
> the ids will be generated in a sequential order, causing the tree to become unbalanced, which will lead to frequent rebalancing.
Given the width of B+trees used in DBs, I doubt they generally need to go more than one or at most two levels up. I’ll take the ability to rapidly follow the leaf nodes and have a good shot at sequential reads in cache from prefetch, thanks.
I don't think this is really true? These are not serially incrementing, they just indicate the time it happened. If you have an ID that you know exists, having the ability to know _when_ it was created is very rarely meaningful.
What could present more of a risk is being able to predict a large part of IDs that will be created. Even then though, you shouldn't depend on your IDs for secrecy - best to ensure the IDs are never used as protection by themselves (ie treat them like they're just a simple autoincrementing number, even if they're not)
There are certainly mitigations that can be made and not all things are equally valuable as they age. (Plus many public APIs include created/modified timestamps anyway. The information is often easy to discover even when not embedded in an ID.) I don't find it a strong reason to avoid timestamp-based IDs for the threat models of that many things beyond user accounts and other things susceptible for social engineering, but it is something to keep aware of.
With UUIDv7, you are reasonably sure that there won’t be collisions (check your use-case first), and can just generate them wherever on-demand (no locks required).
I’d argue batching IDs is actually more complicated than UUIDv7 for most use-cases.
/s
Persistent IDs are a security and information risk. If that's a concern, don't persist IDs.
Here uuidv7 will just re-order that. So the content of the uuid in itself does not change.
We even sharded on these columns, because of this (our business case made it so that hardly ever did people need data over multiple months)
But we never encountered distribution issues. I don't think the locality issue will be solved, as postgres doesn't consider other columns when distributing data, only the primary key IIRC. I don't know why we never saw this, though.
(for folks who don't get it, mullets are a 1980s haircut (think MacGyver) with a short front but a long tail in the back. A funny description of them is "business in the front, party in the back")
If you're using a system which is built for distribution, random is great.
When you're leaning on a Postgres database which has powered your startup through scaling but expects right-leaning btree indexes, it's a bad time.
Rearchitecting to use a new data store is ideal, but often impractical as an immediate step. UUIDv7 is a great increment walking that road via sharding etc.
Somebody posted an interesting article for the instagram ids, which do something similar. They use 41 bits for a time from a custom epoch followed two more groups of bits for a shard id and a sequential number. Each shard has an incrementing sequence for the sequential bit, which guarantees that things on a shard are sorted by time.
This UUIDv7 is slightly weaker than that but sorting things published in the same millisecond is mostly going to be very light work. The lack of a dedicated sharding group of bits is not that important as you could just take the n least significant bits at the end for that without too much effort. Those are random so you end up with nice consistent hashing. Having 48 instead of 41 bits for the time means we won't run out of time any time soon (nearly 9K years vs. 70 years).
Picking the N least significant bits only a single table has good distribution and sort qualities, no cross-table properties.
This is especially useful when your underlying database stores data in large "chunks", such as LSM-trees you find with e.g. rocksdb.
As a sibling comment says, you ideally want to shard on some other key to get "just enough" distribution that all your machines/disks have work to do, but you are still only hitting a limited number of hot sectors on each disk that can be effectively cached. But that requires active monitoring and rebalancing of your data as it grows. Totally random keys are a safe default that will scale with any kind of data distribution and access patterns.
In both cases I'm melding highly disjointed data into a single schema. There are no large consecutive sets of records.
If you're using UUIDs, there's probably a reason. And that reason invalidates the justifications for not using them.
(Or just reverse the bits, take the last n, etc)
I still think that graph databases are way better for this sort of thing.
So really, what are you trying to optimize?
They're also often used as part of a URL parameter:
http://myservice/orders/<uuid> etc etc
The easiest is probably to just base64 the binary representation of the 128 bit number, which results in a 128/6=22 character string, which is a bit smaller.
If glyph-length and not byte-length is more important you could go even smaller but I'm less sure if that's a good idea.
If you want to store UUIDs as compactly as possible you'd use 16 bytes.
If you want to store them as text, mapping them to Unicode would be a terrible idea because: many characters are from scripts you've never heard of, many characters look identical (Α vs A), many characters are decomposed and it can change the encoding if they're decomposed[1], &c.
Or better yet, only decorate one after it has been clicked by the user, that way when it appears again elsewhere, it stands out. If you make each one pretty you'll have made all of them ugly when viewed together.
When you take that UUID and go start sniffing around internal systems you're going to copy the UTF-8 string representation.
> values generated are practically sequential
These statements aren’t strict enough to be relied on. Maybe you have engineered the hell out of your distributed clock scheme, and your IDs actually are completely monotonic, which is great. But you probably haven’t done that, which means conflicts will surely happen and you must handle them gracefully.
None of our systems require perfect ordering of IDs generated across our distributed system. Most of the system was built with random UUIDv4 identifiers so no code assumes the ID ordering is significant.
However, in much of our system recent data is frequently accessed while old data is rarely accessed. In that world, just having the IDs *approximately* clustered in creation order has been a huge performance boost for many queries, and we've seen significant reduction in postgres Write Ahead Log rates, because writes to UID indexes happen in a smaller number of pages.
Thank you. I’m so tired of seeing the same groupthink on UUIDv4 trotted out - “it only matters if you have a clustered index, Postgres is immune!” The hell it is.
Maybe there are applications where the monotonicity matters, but in my experience reasoning by surrogate key is rather coarse grained and you manually scrutinize the boundaries, so unless your clocks are quite wrong, your worries are probably better placed elsewhere.
where ts between txn_start and txn_end
order by ts
and not even realize that what they’re seeing is incomplete and misleading. Clock skew is very common, and we shouldn’t sweep that under the rug to promote time ordering, because people want to believe this works the way they think.Out of curiosity, are you into hybrid logical clocks?
Yeah, though I’m more likely to go with a region ID and monotonic version number to compare-and-set and verify gapless data, where versions from different regions aren’t comparable. Actually I think earlier UUID RFCs talk about a “clock sequence” to distinguish timestamps from separate monotonic sources, but this paper doesn’t bring that up (or mention multiple clocks at all).
In other words: Sorting by millisecond-or-so is just as good as sorting by picosecond in most situations. The reason you have to deal with conflicts gracefully isn't particularly because timestamps can be imperfect.
For providing better query locality it probably doesn't matter significantly though which seems to be the main benefit here while preserving the other benefits UUIDs provide.
I mean that you can’t rely for correctness on time(X) < time(Y) when X happened before Y. It’s damn hard to keep two commodity server clocks within ±1 ms of each other even within a single LAN, and across production you’re more likely to see ±10 ms, or worse if your sysadmins don’t realize you intend to bet the farm on no clock skew.
https://github.com/VADOSWARE/pg_idkit
There are a lot of options for UUID extensions (lots of great pure SQL ones!), but I wanted to get as many ID generation strategies in one place
Also note that native UUID v7 is slated to land in pg17:
Half the point of these things is that they’re treated as opaque identifiers.
So then just a simple validation server side to ensure the data isn't malicious.
All other versions, including the new v7, attach meaning to certain bits of the identifier. That cat has been out of the bag for a long time, so now everyone needs to maintain code to ensure that some rogue node doesn't spew back-dated identifiers belonging to the wrong department.
For the curious:
* UUIDv4 are 128 bits long, 122 bits of which are random, with 6 bits used for the version. Traditionally displayed as 32 hex characters with 4 dashes, so 36 alphanumeric characters, and compatible with anything that expects a UUID.
* UUIDv7 are 128 bits long, 48 bits encode a unix timestamp with millisecond precision, 6 bits are for the version, and 74 bits are random. You're expected to display them the same as other UUIDs, and should be compatible with basically anything that expects a UUID. (Would be a very odd system that parses a UUID and throws an error because it doesn't recognise v7, but I guess it could happen, in theory?)
* ULIDs (https://github.com/ulid/spec) are 128 bits long, 48 bits encode a unix timestamp with millisecond precision, 80 bits are random. You're expected to display them in Crockford's base32, so 26 alphanumeric characters. Compatible with almost everything that expects a UUID (since they're the right length). Spec has some dumb quirks if followed literally but thankfully they mostly don't hurt things.
* KSUIDs (https://github.com/segmentio/ksuid) are 160 bits long, 32 bits encode a timestamp with second precision and a custom epoch of May 13th, 2014, and 128 bits are random. You're expected to display them in base62, so 27 alphanumeric characters. Since they're a different length, they're not compatible with UUIDs.
I quite like KSUIDs; I think base62 is a smart choice. And while the timestamp portion is a trickier question, KSUIDs use 32 bits which, with second precision (more than good enough), means they won't overflow for well over a century. Whereas UUIDv7s use 48 bits, so even with millisecond precision (not needed) they won't overflow for something like 8000 years. We can argue whether 100 years is future proof enough (I'd argue it is), but 8000 years is just silly. Nobody will ever generate a compliant UUIDv7 with any of the first several bits aren't 0. The only downside to KSUIDs is the length isn't UUID compatible (and arguably, that they don't devote 6 bits to a compliant UUID version).
Still feels like there's room for improvement, but for now I think I'd always pick UUIDv7 over UUIDv4 unless there's an very specific reason not to. Which would be, mostly, if there's a concern over potentially leaking the time the UUID was generated. Although if you weren't worrying about leaking an integer sequence ID, you likely won't care here either.
I wish UUIDv7 pulled the version/variant bits up front, though, just to make sure that the identifiers don't all start with null bytes.
"100 years should be enough" is what led us to a mountain of Y2K issues, because when would a two digit year ever be ambigious?
But I guess it's a psychological issue. Unless you're a megalomaniac, it's just natural to assume that your decisions won't matter much outside of your life and lifetime. And in that case, 100 years totally is enough because I probably won't live that long. And even more, in a lot of cases, it's also the correct assumption and the project won't live longer than a few years.
So, thinking about it, unless you are developing a novel standard or something that you want the world to adopt, 100 years probably IS fine. Unfortunately, KSUID wants to be a novel standard, so there's an issue.
https://www.ietf.org/archive/id/draft-peabody-dispatch-new-u...
Also, to all future historians of 2150, sorry about the mess, but yes we knew this was going to happen. Whatever it was.
v1: mac address + time + random
v4: completely random
v5: input + seed (consistent, derived from input)
v7: time + random (distributed sortable ids)
Unless you consider users being able to extract the generation time from the id to be an issue, of course.
Which, despite the fact that it really shouldn't be, still seems to occur every so often. Even in situations where the ids are very much not random.
Honestly if I have to read one more article about a 'hacker' who 'leaked' some secret government piece ahead of time because they thought to increment the date in the url of some yearly report, I'm going to lose my mind.
The performance benefits of index friendly user IDs seem like they would apply even if all user info is secret and requires a token to access... The application still has to look up the user by ID after all?
If I imagine a basic authenticated "get information about me" style endpoint, that would take a user ID and an authentication token. Checking if the token is valid is faster if the user ID is index friendly. Getting the requested information is faster if the user ID is index friendly. Yet a user of the API still needs both the user ID and a token to access anything.
The external key is base64 encoded for use in URLs which results in an 11 byte string.
This hides any information about the size of the data, the creation date of customer accounts (which would be sort of visible with UUIDv7) and prevents anyone from attempting to enumerate data by changing the integer in URLs.
I thought about using UUIDs as external keys but the only compelling use case seems to be the ability to generate keys from many decoupled sources that have to be merged later.
64 bit should be enough for most things https://youtu.be/gocwRvLhDf8?si=QBheJCG21bAAV0Z7
It's similar to UUIDv7 (it leaks the creation time), but it's not an issue for me.
So I am able to have a single 64 bit key, which can easily be formatted into a small string for user-facing urls.
[^1]: https://instagram-engineering.com/sharding-ids-at-instagram-...
As Lazare points out in this thread they're basically the same thing, except with ULIDs you get those 6 extra bits of randomness back that UUIDs have to use for metadata.
ULID isn't an "official" standard like UUID. Having a real standard usually promotes interoperability and makes it easier to use. Additionally as others have pointed out you can already use UUIDv7 with some databases since it's just 16 opaque bytes and the database doesn't care what's actually in the UUID field.
Maybe if we were starting from scratch ULIDs would have been an option, but given where we were UUIDv7 was a much easier transition.
I mean this is a similar concern to sequential IDs: many apps do not want to leak them, and in some cases it might cause issues, but in general it doesn't matter.
I can therefore easily generate a new UUID in a trusted backend service which just accepts the command received from the untrusted client and then forwards the request for asynchronous processing while returning the UUID to the client. This is a typical architecture and the only change is that I can now create UUIDs which may have performance benefits, depending on the data storage technology of my read models.
If you need to create the UUIDs on the client side to support specific requirements such as offline-first, then I would indeed consider adding some reconciliation which replaces the IDs provided by the client-side by new ones generated by a trusted component as soon as synchronizing takes place.
It might be insignificant, but to me it makes UUID feel tainted, dirty. 11.1% of a UUID are dashes. 15.3% of a UUID are wasted bits if you count version and variant bits.
Anecdote: I worked for a company that used numeric primary ids internally and externally and increased the primary key by TWO to THREE for each new customer to make it appear to the outside world we had twice to three times the rate of customer growth.
Couldn't that be solved with incremented serial numbers, rather than leaking time data?
This solution attempts to solve the sort-ability issue of current uuids by moving the timestamp to the most significant bits.
GET /filter?a_id=X&b_id=Y&c_id=Z&d_id=w
But in practice we were using POST and passing the ids in the body payload. Why Because my old team said "the UUIDs are long, so we may reach the maximum URL length if we pass them as parameters". I didn't like it, and I still don't like it at all.Note that you can also use GET with a body, it’s not spec compliant (a body is allowed but not supposed to have any meaning) but is used by products such as Elasticsearch. If you control both clients and servers that’s something you can safely do (and use an etag header for idempotency).
Another thing is that you don’t necessarily need to encode uuids canonically. They are just u128’s. It’s relatively straightforward to find a url friendly string representation that is shorter.
Are we talking about shortening the whole URL or shortening specific UUIDs? If the latter then I imagine one would still need to keep track of the mapping UUID <-> shorten version, somewhere, right? If so, why not just add yet another field/column for an old good numeric integer that can be used for filtering? Would that work?
https://blog.devgenius.io/analyzing-new-unique-identifier-fo...
If you need to sort by insert order, use an autoincrementing integer, if you need uniqueness, UUIDv4 is fine, if you need both use both.
Use timestamps when you need to record the time, just don't commit the sin of presuming that clock time will never run backwards, I assure you, it does.
I hold to the principle that relational data should be normal, and combining uniqueness with a timestamp doesn't do that. To do any of the calculations we use timestamps for, you have to strip off the entropy, this complicates pushing it down to the database level, where the libraries don't expect such conflation.
You're going to have a bad time writing something like a join across tables with a restricted range of time if your time is embedded in UUIDv7.
I maintain this is good advice: if you need index locality and insert order, use an autoincrement. If you need to record and work with time, use a timestamp. If you need global uniqueness, you can use any of the UUIDs, but v4 is the one that doesn't conflate uniqueness with unrelated properties, and should be preferred.
If you think your need data locality but not insert order, think long and hard about what you're doing, because odds are you're wrong. If it turns out you're right, and the OP might be in that situation, sure, go ahead and use UUIDv7.
Just, please, for the sake of your future self and everyone you work with, don't use a timestamp for insert order. Ever.
Later I was excited about the power and expressiveness of SQL and its extensions. There is a ton of leverage and you can make it so that interfacing with it directly becomes much more useful.
However now I’m in a different phase. I see it as a durable data structure. I think in terms of “what does it provide to make the overall system better?”
The issues around indexing and uuids that is discussed in the article fits nicely into this line of thinking.
In web development, database access and performance often dominates and infects the whole system.
$ date -ud @$(( 256 ** 6 / 1000 ))
Tue Aug 2 05:31:50 AM UTC 10889Good question.
Won't random 128-bit numbers actually be superior to UUIDs in every way except predictability?
For another project, I've also used sortable 64-bit snowflake-like identifiers; they have the added benefit of being able to use 64-bit integer representation in code and database identifiers, even if you might want to externally represent them in base58 or similar encoding.
The original UUID types aren't as useful as they once were, so it'd be worth writing a new RFC and extending those original types.
a) UUID4, CreatedTime/UpdatedTime.
b) Bigint, CreatedTime/UpdatedTime.
c) UUID7 internal (which also includes time badly), UUID4 external/whatever short ID.
How exactly this helps if you need external ids (which you usually do today)? It doesn't even make it a short ID.
Even if there is a corner case, are we just saving a few bytes while adding more complication?
Clustered Index is a myth in PostgreSQL, not practical since you have to run a special program to reorder. So, a regular index might suffer but not really. Why? Because I am not ordering by the ID most of the time, I am ordering by "Created Date/Updated Date" or Name or whatever. Who cares about ordering IDs?
WAIT!!! But what about Next Tokens? ok, these are painful, but easily solved: Next can be (>=Created Date,>ID). Same result. Pagination, stays the same since it is sorted by Created Date.
The external Id is used instead of Bigint because you don't want your external users to query 1, then 2, then 3 (IDOR)... But the random part of the Uuid7 makes this impossible.
Uuid7 isn't a substitute for Created/Updated, but a substitute for the dual field Uuid4/Bigint.
Analyzing New Unique Identifier Formats (UUIDv6, UUIDv7, and UUIDv8) (2022) https://news.ycombinator.com/item?id=36438367
We've seen some amazing benefits, especially around improving the speed of batch inserts.
You set the node field to a broadcast MAC address, and use that as a namespace/prefix. This inches close to the boundary of the RFC, but is arguably compliant.
As an example, you may generate demo or “canary” data items that are UUIDv1s with a well known node field, which then lets you do distributed “isDemoData()” checks by just looking at the UUID.
I would assume that `serial` would solve this problem too.
[1]: https://github.com/segmentio/ksuid which has very similar use cases.
For bigger/public projects I'd like to be able to add a sequence, node, and data center id to the UUID too.
IE, take a 32 or 64 bit int that's the primary key, encrypt it, and then use that as the public ID in a web application, URL, API, ect.
Hope it won’t bite me in the future.
> As a result, retrieving the most recent data from a large dataset will require traversing a large number of database index pages, leading to a poor cache hit ratio (how many requests a cache is able to fill successfully, compared to how many requests it receives).