Some of the details escape me (this was a decade ago) but it was a fun combination of statistical inference and CS knowledge that I don't get to use often. Whenever integer overflow comes up in a systems engineering context I get a little tickled.
The integer generation was pretty simple, there was a fixed id of each server, and unless I a mistaken we have 5 servers per datacenter. Each id was basically <time><offset><id> where time was a millisecond timer, offset was the number of ids generated in that same millisecond by the same server, and id was the machines unique identifier. When we first talked about this process I thought that offset was going to roll, every id would increment it by one. This was changed to resetting it every millisecond specifically so that it would obscure tweet volumes.
At the time I remember reading a LOT of articles estimating tweet volume and most of them were way, way off. I don't know that we ever really put effort into correcting them though. =)
* - Does not account for changes in the system post 2012.
The offset was actually how we calculated volume, because millisecond collisions become a variant of the german tank problem[1]. A few times when y'all made tweet volumes public it mapped pretty closely with our estimates.
This was around 2011, so your knowledge should be relevant.
But IEEE 754 doubles have a significand that supports a 53-bit range. What am I missing?
In other words, what do you mean that it was done by doing mod 3 of a signed int32? If it was a monotonically increasing or random int32, I don't see how that unevenness would manifest in a meaningful way.
I don’t get how mod 3 affects anything if you’re just incrementing…
Also what number are they using to modulo and where is that happening? Because at that point don’t they already have an incrementing ID before generating another one?
0->A
1->B
2->C
3->A
4->B
5->C
6->A
7->B
A and B get hit three times while C only twice, so it will see 66% utilization compared to A and BEDITED s/once/twice/ thanks CyberDildonics
Eventually, we learned to treat Twitter as a lead generation tool for off-platform activity and apply old school funnel mechanics to it. The next problem became how to build a follower count. Sadly, that problem is what I think led to extremism on the platform. Hence: https://madrox.substack.com/p/yet-another-quitting-twitter
Also collecting data like this can be useful if you want to beat markets.
Seems this was less likely a "someone else will deal with it" problem, and more of a development / QA testing problem.
Then 10 or so years went by...
Whenever I write code like that which may break in say, 5 years, I'll sign it in the comments and put my personal email and phone number inviting future people to call me and I'll fix it for free (cause I take responsibilities for my code pretty seriously). Nobody has ever taken me up on it though...
So am I being naive when I use 64 bit values for unique IDs? Or is it actually plausible that 64 bits is plenty of information for centuries to come?
Edit: Also, technically reddit was using signed int32s. So they really only had 2^16 unique digits. If they used unsigned int32s, then that would have bought them a lot of time.
Small correction: Signed int32s means you have 31 bits for the integer value (2^31 values), not 16 bits (2^16 values). There are 32 bits; 1 bit is used up for the sign, and 31 remain for the number.
I should also probably point out some ambiguity in your analysis. To be clear, under the faulty assumption of linear consumption, if Int32s get exhausted in 10 years, switching to UInt32s will last 20 years of which 10 are already spent. So, under the faulty assumption of linearity, switching to UInt32s buys you an extra 10 years.
64-bit identifiers will last quite a while. Another defensible choice would be to switch to 128-bit unguessable identifiers (such as assigning them via AES-256 in counter mode) in order to provide defence in depth. If something goes wrong with your authentication, but the attacker is likely to spend millennia hammering your REST API with invalid IDs before exploiting an actual asset, then the consequences aren't so bad.
Signed integers have 2^31 positive values (well, just about). It doesn't take 16 bits to store the negative sign :-)
That said, it would still take quite a while to get to 64 bits.
Infinite options at that point.
Isn't the answer to your question buried in your assumptions? For example, you seem to assume the rate of comsumption of unique IDs is static.
All I'm saying is the jump from 2^32 to 2^64 is astronomical. I don't see using 64 bit integers for uids in my hobby code as something to be concerned about. In production code for a company I would use something more significant, but even then I feel like 64 bits will last a very long time.
[0]: https://bernardmarr.com/how-much-data-do-we-create-every-day...
2^31 seconds is 68 years, a middling human lifetime.
2^63 seconds is 290 billion years.
It appears that the original developer thought that for 32 bits :)
I say that as someone who inherited a system that allowed for 2^32 session references stored in the db. Lets just say that sessions were created a lot faster than anticipated for the amount of traffic we had.
So one fine Sunday morning in a November a long time ago we ran out.
So, everything worked great until it didn't, and they segment a lot of time future proofing it.
If you open a transaction, INSERT with AUTO_INCREMENT, then rollback the transaction, no data is saved, except the auto generated id is used and the next INSERT uses id+1.
0: https://arstechnica.com/information-technology/2014/12/gangn...
Well, I say that, but actually, "adding an extra bit" is basically what going from signed to unsigned would do. So maybe they just added an extra (32nd) bit?