You know, you think that, but it's never that simple. The field was added incorrectly and nobody noticed until the value is in countless tables that you now need to simultaneously update or the value is something that's supposed to be semi-secret, so now a low level support staff can't reference the row when dealing with a request. Or the table's requirements change and now you need to track two different kinds of data or data that is missing the field.
Me, I always just have the table make its own ID. It is just simpler, even when you think it is overkill.
Except that I knew a coworker who had a duplicate ID. An extremely rare event, they messed up the pre-assignment and there is another dude somewhere with his same ID. So from time to time, some system would tell him that his ID was already registered. A lot of banks and stuff like private healthcare systems like to use the DNI as usernames.
He tried to get his ID changed, but that was such a foreign concept to any of the involved institutions, that he had to give up because there simply is no such procedure. I guess he could have taken it to court, but the guy decided to just live with it (the justice system is quite slow here).
For instance, a driver's license number is printed on the card itself, so a human sees it. Therefore, it's a natural key, just like a name.
When you decide that whatever natural keys already exist aren't good enough for your organization, and you make a new key, it's not good to think of that as a surrogate key. The number will make it out somehow (as a "record locator" in a customer support call or something), and eventually become a natural key.
It's best to just plan for any new key to be a natural key, which means using best practices for natural keys. That means it should be something reasonable to print, read, say, and hear; and it should also follow a pattern so it can be distinguished from other special numbers.
Auto-increment is a shortcut, but usually not great in the long term unless it's something that will be well-contained inside the database as an implementation detail (e.g. a join key designed to refer to rarely-accessed fields of a wide table).
> You know, you think that, but it's never that simple.
It’s that simple if you’re the Social Security Administration and its a table of Social Security Accounts, not people.
Other than that, using SSNs as a primary key is just plain wrong.
Nah, they keep track of duplicate usage: https://www.nbcnews.com/technolog/odds-someone-else-has-your...
> The IRS often knows when this happens, when the imposter pays taxes. The Social Security Administration knows, too, for the same reason. And the nation's credit bureaus usually know, because the imposter often ends up applying for some form of credit. Plenty of financial institutions also have access to this information.
We moved to a new HR system and they have a set of "reserved" employee numbers that cannot be used and we have employee numbers in that range. Arg!
And we were supposed to teach our customers good design principles...
A projection in ES, is more a cache, not your primary store. The primary store is the eventlog. The latter should, obviuosly, never use natural ID's.
Essentially, they observed sizeable performance improvements by using UUID generators that are tweaked to get more sequentia resultsl. It results in better indexes. The articles compares sequences, random UUIDs and 2 kinds of sequentialish UUID generators.
There's another benefit to UUID - You can generate them anywhere including application side. Doing this on application side would have tremendous batching benefits or inserting objects with relationships at the same time (Vs waiting first insert to return an ID to be used in the FK).
There are a few mitigations but my favourite is the "casino chips" approach: pregenerate them server side, and allocate to clients on demand, including en masse if need be ("here kid, have a few million UUIDs to get you started"). Verify with whatever simple signature scheme comes with your application server framework, or at small scale just toss them in a crude LRU store.
Or, remember where the UUID came from, and apply their organisational scope to any trust you place upon it. This might work particularly for multi-tenanted SaaS. However it requires that all usage is tenant-bounded end-through-end throughout your application. This may be in conflict with a) your framework, b) your zenlike contemplation of simplicity in data management, or c) programmers in a hurry forgetting to scope their queries properly.
Ultimately, relying on UUIDs as intrinsically unguessable security tokens is probably not a great idea, but it's one that remains thoroughly embedded in the programming zeitgeist. As ever, nothing useful comes without a compromise. Keep your eyes open to the systemic consequences of design choices, and don't leave traps for your fellow developers.
Allowing simple caching, easy async, easy handling of offline, or far simpler clientside code for dealing with those objects. etc.
For mobile app development which relies on an online (http) backend, clientside generatable UUIDs offer almost only benefits.
You can also use one sequence for everything on the server, and then you can also pre-create ID based relationships client side.
Clocks aren't reliable enough for timestamps anyways so garbage collection is the only thing you kinda wanna rely on them for.
A good sweet spot seems to be, 32bit milliseconds + 96bit of entropy. This overflows appeoximately every 50 days, allowing for 50 day rolling data retention.
I ended up deciding to use sharded bigints because it enables better indexing, even though there are drawbacks when first writing the instances; The benefit of faster search was more important for me.
[1]: https://instagram-engineering.com/sharding-ids-at-instagram-...
[2]: http://www.livinginthepast.org/2016/02/24/sharding-into-bigi...
They are not serially incrementing but still sortable. Thus prevent index fragmentation issues observed with UUIDS. Are 8 bytes in length. So index size is smaller compared to UUIDs. So you get all benefits of serial IDs but they are not easily guessable thus preventing sequential access attacks.
What unimaginable scale
Today, a decade later, they're at 1.074B
I don't see how that's true. From reading the article you linked, you only need a valid shard ID (which you can extract from known IDs), the millisecond (which is guessable) and a 10-bit sequence (which you can easily brute-force).
(And that's completely fine if their security model doesn't require unguessable IDs.)
It will results in a very high number of 404s. These can be monitored and the origin IPs can be banned.
So it's good to know that performances are not bad.
Yes, they are guessable but your application should not rely solely on the "secrecy" of the ID to authorize access to a record. If you are worried about someone crawling your public API with wget or curl and an incrementing counter you should re-think whether your data are really public or not, or maybe rate-limit anonymous users, etc.
They also reveal something about the total number of records in your database, I guess that could matter in some contexts but it's never really been an issue in practice for me.
I have definitely used the technique of defining non-overlapping sequences in different shards (with Oracle, not Postgres, but they are very similar in this regard). It worked very well and was easy to reason about.
As a developer, the big issue I have with UUIDs is that they are impossible to read or type. You have to copy/paste and it isn't easy to look at two UUIDs and tell if they are different.
I use integers in general unless the specific use case demands something else.
Any information you give to a potentially malicious actor can help them attack you. If you have a choice between leaking information and not leaking information, I can’t imagine why you would ever intentionally go with the former, unless you didn’t actually have a choice (feasibility, etc.).
As an example, maybe I needed to CSRF someone but needed their ID (say, in a b2b app where IDs are not generally public) - with sequential IDs I have a decent chance of cracking it with a small number of requests, especially if it’s a small userbase. Sure, the CSRF was the main issue, but this is a contrived example to illustrate the point.
Admittedly, IDs are oftentimes public information by necessity - but there’s no need to allow them to leak extra information.
Of course, there are many ways to solve that situation, but UUIDs is one.
They can inherently leak how much data you do or don't have, which you may not want your competitors to know.
Doing this on application side would have tremendous batching benefits or inserting objects with relationships at the same time (Vs waiting first insert to return an ID to be used in the FK).
By separating that out you can get a lot:
1. You can extend your delegate system by modifying the delegate table, rather than having to muddy your data model with authority information
2. You can TTL the mappings in the delegate table
3. You can have many mappings to the same data, but each one can have its own restrictions.
It's a bit more complex but you end up with a system that really hones in on the benefit that you're bringing up.
Use integer primary keys internally for identifiers and relationships.
Use English/Other Language permalinks for URL's
Use UUID's in places like API's one-time action links and "private" links that you only want to share with other people.
Worked fine for me for many, many years.
IME it's much more often I've quickly made a table with a serial PK and later wished it were uuid; just about never made a uuid and later wished for the compactness or natural clustering of bigint. Maybe for a table of millions and millions of time-ordered events.
Unless you're changing a foreign key, joins will always be correct.
Unless I'm doing something wrong in the last 30 years of using SQL.
Side question: can I get Postgres to throw an error if I try to join on two IDs where neither of the IDs have a foreign key reference to the other?
I end up doing this a lot when I'm trying to figure out how my applications are currently being used.
Strictly incrementing UUIDs can offer the same benefit.
I used to work on a reconciliation system which inserted all its results into the database. Only the most recent results were heavily queried, with a long tail of occasional lookups into older results. We never had a problem with primary key indexes (though this was in MySQL, which uses a clustered index on the primary key for row storage, so it's an even bigger benefit); the MD5 column used for identifying repeating data, on the other hand, would blow out the cache on large customers' instances.
PG will say it's doing a hash look up and you'd think it'd be fast but it will take quite sometime relative to joining two large tables with integer IDs. With UUIDS PG will give up doing a hash look up sometimes and try to do table scans unless you adjust random_page_cost.
In general joining on UUIDs for large tables is a bad idea. It can be great if you are joining a single row to another row.
I eventually came across the idea of using maximal period linear-feedback shift registers to transform an integer variable through every possible value (minus one), but in a non-incremental sequence that depends on the LFSR arrangement.
I never ended up putting the idea to use, but I've always been curious about people who have and how it worked out for them. [Edit to clarify: It was meant for obfuscation, not security against a determined attacker.]
This works well against the German Tank Problem when there's no oracle allowing an attacker to guess lots of IDs quickly (such as when there are reasonable rate limits). It does not provide enough entropy when such an oracle exists (especially an offline one).
For something like a password reset token, it still needs to be paired with suitably random bytes.
So far I haven't encountered any problems in the short term by using the approach described.
1 https://blog.twitter.com/engineering/en_us/a/2010/announcing...
I was thinking of generating random character strings and simply retry when the db throws duplicate key error on insert. No sharding is necessary and I’d like to have efficient foreign keys. Any thoughts?
However, I cannot imagine creating table entries without a datestamp. No matter what else you are doing, or what you index by, I would want YYYY-MM-DD_HH-MM-SS in every row.
Maybe I'm just weird that way ...
Recently I built a new system (typical business-type backend) and forced to use sqlite + C# + dapper. Using this combination I cannot use guid/uuid as dapper cannot properly map it back to c# from sqlite, and my dislike of int's got me thinking. I have a random string generator (have used it for years for things like OTP's and other reference numbers), where I give it an alphabet + length of the desired string. Using 8 to 12 characters, I can get a few million unique permutations. That is, if used as a primary key, few million per database table. Then I hear in the back of my head, guys from work who would argue I would run out of unique combinations or would have to do lookups to see if they exist. So I decided slap the year and month on it as a prefix, so a key might look like this: 2105HSUAMWPA. This gets indexed really well too and there is some inherent information that can be seen from looking at the key: Year 21, Month 5 and then the unique bits.It's basically 4 lines of code that gets called on every new database entity. I think it will be easy to shard/partition the data too if the need arise in the future, by simply looking at the first 4 digits.
Thus to summarize:
Data is sliced by entity type (customer, invoice, etc), then by date (2105 for May 2021) then by unique string.
What do you guys think about this approach? Anyone been burnt by something like this?
Are you sure about this? This is pretty poor of a well known solution in the ORM world, SQLite or not.
If you were going for sortability/understandability then I understand slapping your own together, but why not generate v1/v4/v6[0] UUIDs in your application and then send them along to teh database, possibly prefixed with whatever you want it to be sorted by (though IMO you should just add that metadata to the thing being saved and sort on that properly)?
But of a rabbit hole: https://github.com/DapperLib/Dapper/pull/1082
It would be nice to see real benchmarking on millions of rows to compare the three, but my gut tells me you use int by default, bigint if you outgrow int, and UUID if you have plenty of money for hardware and need distribution capabilities a UUID would enable.
I'm generally fairly comfortable using int for business relationships, and bigint (long) for transaction data.
For performance, insertion speed often seems to be dominated by 'commit latency' to sync to the disk; rather than by record size. I would agree that record size affects table scan, but for many datamodels keying may often be a relatively small proportion compared to the size of text fields and other data.
I like to model keyspaces to work for 200 years, for the largest forseeable market growth, times at least a factor of 10 for safety.
One of the most memorable anecdotes of my professional career is a production environment going down because we hit maxint on an important (and busy) table. The dirty hack we used to get the site back up (hint: int is _signed_), and the weeks it took to plan, test, and execute the migration.
in one sense I agree with the author that things are generally just easier when you use surrogate primary keys, however they really should note here that the FOREIGN KEY constraint itself is not a problem at all as you can just use ON UPDATE CASCADE.
(Actually, all serials are bigserial’s but the “base type” they add to the table differs, and it’ll always come back to bite you later. Ask me how I know…)
[1] https://en.wikipedia.org/wiki/Birthday_problem#Probability_t...
It's an interesting tradeoff. The UX of the smaller YouTube video id links is probably of some benefit to them. Plus they have private videos for when you really don't want your video to be viewed, with unlisted being the middle ground of easy sharing but also keeping it exclusive.
What’s the use case for this where UUIDv4 or sequential ID isn’t better? Because it sounds like a solution in search of a problem.
> Are there any downsides besides making the application side a tiny bit more complex?
Are there any upsides to warrant the complexity?
Seems too easy to screw up.
https://www.postgresql.org/docs/current/plpgsql-control-stru...
For your use-case you could use incremental IDs.
For one thing you can represent a range of data more efficiently by just storing offsets. This means that instead of having to store a 'start' and 'end' at 8 + 8 bytes you can store something like 'start' and 'offset', where offset could be based on your window size, like 2 bytes.
You can leverage those offsets in metadata too. For example, I could cache something like 'rows (N..N+Offset) all have field X set to null' or some such thing. Now I can query my cache for a given value and avoid the db lookup, but I can also store way more data in the cache since I can encode ranges. Obviously which things you cache are going to be data dependent.
Sequential ints make great external indexes for this reason. Maybe I tombstone rows in big chunks to some other data store - again, I can just encode that as a range, and then given a lookup within that range I know to look in the other datastore. With a uuid approach I'd have to tombstone each row individually.
These aren't universal optimizations but if you can leverage them they can be significant.
- isntauuid[1] (mentioned in this thread, I've given it a name here)
- timeflake[2]
- HiLo[3][4]
- ulid[5]
- ksuid[6] (made popular by segment.io)
- v1-v6 UUIDs (the ones we all know and some love)
- sequential interval based UUIDs in Postgres[7]
Just add a UUID -- this almost surely isn't going to be what bricks your architecture unless you have some crazy high write use case like time series or IoT or something maybe.
[0]: http://gh.peabody.io/uuidv6/
[1]: https://instagram-engineering.com/sharding-ids-at-instagram-...
[2]: https://github.com/anthonynsimon/timeflake
[3]: https://en.wikipedia.org/wiki/Hi/Lo_algorithm
[4]: https://www.npgsql.org/efcore/modeling/generated-properties....
[5]: https://github.com/edoceo/pg-ulid
[6]: https://github.com/segmentio/ksuid
[7]: https://www.2ndquadrant.com/en/blog/sequential-uuid-generato...