I believe this behavior is changing in the 2024 edition: https://doc.rust-lang.org/edition-guide/rust-2024/temporary-...
Past tense, the 2024 edition stabilized in (and has been the default edition for `cargo new` since) Rust 1.85.
Their writing is so good, always a fun and enlightening read.
This tier approach makes a lot of sense to mitigate the scaling limit per corrosion node. Can you share how much data you wind up tracking in each tier in practice?
How concise is the entry for each application -> [regions] table? Does the constraint of running this on every node mean that this creates a global limit for number of applications? It also seems like the region level database would have a regional limit for the number of Fly machines too?
So is this a case of wanting to deliver a differentiating feature before the technical maturity is there and validated? It's an acceptable strategy if you are building a lesser product but if you are selling Public Cloud maybe having a better strategy than waiting for problems to crop up makes more sense? Consul, missing watchdogs, certificate expiry, CRDT back filling nullable columns - sure in a normal case these are not very unexpected or to-be-ashamed-of problems but for a product that claims to be Public Cloud you want to think of these things and address them before day 1. Cert expiry for example - you should be giving your users tools to never have a cert expire - not fixing it for your stuff after the fact! (Most CAs offer API to automate all this - no excuse for it.)
I don't mean to be dismissive or disrespectful, the problem is challenging and the work is great - merely thinking of loss of customer trust - people are never going to trust a new comer that has issues like this and for that reason move fast break things and fix when you find isn't a good fit for this kind of a product.
The "decision that long predates Corrosion" is precisely the point I was trying to make - was it made too soon before understanding the ramifications and/or having a validated technical solution ready? IOW maybe the feature requiring the problem solution could have come later? (I don't know much about fly.io and its features, so apologies if some of this is unclear/wrongly assumes things.)
Huge pet peeve. At least this one has a date somewhere (at the bottom, "last updated Oct 22, 2025").
Is this a typo? Why does it backfill values for a nullable column?
https://github.com/vlcn-io/cr-sqlite/blob/891fe9e0190dd20917...
To ensure every instance arrives at the same “working set” picture, we use cr-sqlite, the CRDT SQLite extension.
Cool to see cr-sqlite used in production!vlcn-io/cr-sqlite definitely built by someone who doesn't understand the fundamentals of the space
> As of cr-sqlite 0.15, the CRDT for an existing row being update is this: (1) Biggest col_version wins
col_version is definitely something, but it isn't a logical timestamp!
--
https://github.com/superfly/corrosion/blob/main/doc/crdts.md
> Crsqlite specifically uses a "lamport timestamp" which, if you squint at from a distance, could be most concisely boiled down to a monotonically increasing counter.
lamport clocks can be boiled down to monotonically-increasing counters _per physical node in the system_, not per logical row/entity in the data model
so if you want to do conflict resolution based on logical (lamport) clocks you need to evaluate/resolve concurrent modifications according to site-specific logical clocks and their histories -- not just raw integers
which 100% vlcn.io does not do
> destroyed comes before started and so started is "bigger"
eep. good luck!
But they have to. Physically no solution will be instantaneous because that’s not how the speed of light nor relativity works - even two events next to each other cannot find out about each other instantaneously. So then the question is “how long can I wait for this information”. And that’s the part that I feel isn’t answered - eg if the app dies, the TCP connections die and in theory that information travels as quickly as anything else you send. It’s not reliably detectable but conceivably you could have an eBPF program monitoring death and notifying the proxies. Thats the part that’s really not explained in the article which is why you need to maintain an eventually consistent view of the connectivity. I get maybe why that could be useful but noticing app connectivity death seems wrong considering I believe you’re more tracking machine and cluster health right? Ie not noticing an app instance goes down but noticing all app instances on a given machine are gone and consensus deciding globally where the new app instance will be as quickly as possible?
Did you ever consider envoy xDS?
There are a lot of really cool things in envoy like outlier detection, circuit breakers, load shedding, etc…
What we (think we) know won't work is a topologically centralized database that uses distributed consensus algorithms to synchronize. Running consensus transcontinentally is very painful, and keep the servers central, so that update proposals are local and the protocol can run quickly, subjects large portions of the network to partition risk. The natural response (what I think a lot of people do, in fact) is just to run multiple consensus clusters, but our UX includes a global namespace for customer workloads.
I was thinking I'll just have to bite the bullet and migrate to PostgreSQL, but perhaps rqlite can work.
This blog is not impressive for an infra company.
Makes you think that's all.
> in case people don't read all the way to the end, the important takeaway is "you simply can't afford to do instant global state distribution"
This is what people saw as the key takeaway. If that takeaway is news to you then I don’t know what you are doing writing distributed systems.
While this message may not be what was intended it was what was broadcast.
it would be super cool to learn more about how the world's largest gossip systems work :)
We're actually keeping the global Corrosion cluster! We're just stripping most of the data out of it.
We are probably past the size of the entirety of fly.io for reference, and maintenance is very painful. It works because we are doing really strange things with Consul (batch txn cross-cluster updates of static entries) on really, really big servers (4gbps+ filesystems, 1tb memory, 100s of big and fast cores, etc).
Nice.
and I think the intended webfont is loaded because the font is clearly weird ish and non-standard and the text is invisible for good 2 seconds at first while it loads:)
In what sense do you think we need specialty routers?
How would you deploy Postgres to address these problems?