I don't really see how a transaction that performs reads and writes can be replayed like this with any sort of guarantee about anything, though.
And there's no opportunity to build any sort of conflict resolution in there because the replay is automatic.
Maybe I missed something though?
My experience of building web apps, where most business logic runs within the scope of a single HTTP request, suggests that a quite impressive number of common cases could be served by the replay-request-against-the-leader pattern.
Also just for what it's worth: we agree with 'foobarbazetc. There's a section about that in the post. If you're saying there are important classes of applications this doesn't work well for, that's true.
1) Read current state of riders/drivers from DB (slightly expensive)
2) Solve a vehicle routing problem with the new rider request added to the current state (can be VERY expensive)
3) If there’s a good new solution, commit the changes that the VRP solutions suggest (this is the DB write, and it’s only slightly expensive)
The approach in this blog post would have us duplicating the most expensive thing our app does (step 2 above), for most requests - not good. Much more load on our system, and these already slow writes would take ~twice as long.
I’d hope Fly also lets you configure the load balancer - i.e. have a way to send certain requests to the “writer” nodes by default, vs. in a retry.
For most of the applications I build the HTTP verb is good enough for this - so I would add a tiny piece of Django middleware which looks for a POST to a non-primary region and sends fly-replay straight away at that point.
- Deletes, puts and patches are basically guaranteed to be writes
- Get is GENERALLY read only, but it often updates caches - and your cache may have similar concerns. Also, sometimes gets write to the DB, i.e. updating a “last seen/last activity” type field
- Posts are generally writes, but definitely not always. Any time you really need a body for a read (big request coming from a browser, or just too much structure in the args to encode in HTTP query params), that read will be a post
I like your approach of short-circuiting and immediately sending fly-replay, but instead of doing it by HTTP method, I’d probably do it by manually marking endpoints as write endpoints. And then I’d also have a catch-all similar to the blog post, based on DB errors, that both sends fly-replay and logs. And then keep track of that log, if it ever happens that’s a sign that you need to mark a new endpoint as a writer.
I'm in SJC (San Jose / Silicon Valley) and https://fly-global-rails.fly.dev/regions/syd shows that my replay time to their region in Sydney Australia is between 25 and 45ms. That's pretty quick!
Although even with REST you should be able to do the same thing, assume GETS can go to the read replicas, should work if you aren’t doing anything weird.
It didn't work cross region because no one builds an app expecting latency between their app server code and database. What we'd see on write requests was something like:
1. Query for data from read replica, perhaps for validation (0ms)
2. Do a write to the primary in a different region (20-400ms)
3. Query primary for consistency (20-400ms)
4. More queries against primary for consistency (20-400ms)
5. Maybe another write (20-400ms)
6. repeat
You can actually use our postgres this way if you want! But it breaks for most apps. It is much, much faster to ship the whole HTTP request where it needs to be than the move the database away from an app instance.
Does that mean that the app server would just fail, and then your network replays the same HTTP request (made by the client to the edge), toward an instance running the app server inside the region where there the Postgres primary instance is running? If so, you should then be able to figure out which client request provoked the SQL error and the app server would report some errors, right? Otherwise, the app server should be able to tell you if it failed because the Postgres server does not support write statements, which would require some minor adjustment in the app server code, isn't it?
I am sorry if I missed something from your description, but I am trying to figure out what the flow would look like. Thank you!
If you're already routing at the HTTP level then why not just route the request based on the HTTP method? (assuming the handlers for those are properly implemented.)
Why not just route a POST request to a backend connected to a writable database? And GET to a read-only database? I don't understand how it's any kind of an optimization to be continually bouncing queries out of the read-only replicas because Postgres threw an error.
"Most GETs are reads, but not all of them. The platform has to work for all the requests, not just the orthodox ones."
So I think it all depends on which level of the application you take the decision to gracefully fail and replay the request elsewhere. In some case, it is trivial: as soon as you see a POST/UPDATE/DELETE, because you trust your app; sometimes you may need to take the decision later, or at a lower level.
In the simplest scenario, fly.io could just forward the request to the right region, without even bothering the app server to reply with an error, but that would work only if GET requires no writes.
We can give people a library that catches Postgres readonly errors and make it a reasonably standard experience. We can't ensure that peoples' apps have good write hygiene. We _can_, though, educate people and tell them what to look for when they're trying to optimize performance.
There's also the graphql problem (and really any kind of non-rest RPC). It's somewhat rare that applications use HTTP verbs appropriately, APIs tend to bypass HTTP methods.
I'd worry about a POST route that does some expensive/slow reads/computations in the first half of the request, and then only writes at the end – lots of lost time! Would have been much better to say, hey, for this route (or for any POST route in _my_ application), please bounce to the region with the primary db instance.
Actually, while that could be done at the application level with reasonable latency, it'd probably still be better to allow the user to write some rules at the proxy layer for the "first guess".
Yes, that's what the code fragment does:
> if e.cause.is_a?(PG::ReadOnlySqlTransaction)
> YugabyteDB is the open source, distributed SQL database for global, internet- scale applications with low query latency, extreme resilience against failures. *
However Fly is nicer from a developer's point of view, because it doesn't require you to learn a new query language, you can write good old Postgres.
> Successive YugabyteDB releases honor PostgreSQL syntax and semantics, although some features (for example those that are specific to the PostgreSQL monolithic SQL database architecture) might not be supported for distributed SQL. The YSQL documentation specifies the supported syntax and extensions.
It is psql compatible, but you may run into missing features, so you will have to find a way around that. The documentation says that their latest release is based on psql 11.2, and psql 14 will be released soon.
1. The first thing we did was give them CockroachDB. Modeling data for geo distributed Cockroach is conceptually similar to yugabyte. It's not easy, though, to do that modeling. For read heavy apps most devs didn't want to do the work.
2. We also experimented with cross regions writes at the DB level (both with Postgres and Cockroach). The failure cases on those were bad. Apps are pretty naive about queries, so interleaving writes with >10ms of latency between reads resulted in really slow requests. These were pretty frequent! Most apps broke if the devs didn't carefully optimize that sequence.
The free we're going for here is "free for most developers". Changing a data model or optimizing how an app talks to a database wasn't quite free enough for our purposes, and neither were necessary for the types of workloads most people deal with.
They _definitely_ need to use something that's not-just-postgres to do multi region writes or distribute write heavy workloads. That's why we wrote the whole section on graduating to Cockroach. Graduating from Postgres to yugabyte also works!
create tablespace with replica placement, create table with tablespace, give the role explicit access to a single talespace
Yugabyte does everything else. It’s all standard sql with a couple of extra keywords.
This is just a neat hack you can do with totally standard Postgres. It's a thing other people do outside of Fly; we just like how easy you can make it if you can program both the CDN and the applications.
So your computer has memory banks that (as a rough first approximation) each get ~10x bigger, but with ~10x greater latency. The neat part though is that global consistency for writes happens with L1 and L2 working together, so pretty damn high up in the hierarchy. In contrast distributed applications typically at best will have a write through cache where global consistency happens all the day down at the actual data store. Exploring the idea of distributed MOESI where one client, because they had the permissions to write something in the first place, can then be the owner for that database row as it's still being flushed out seems like a great basis for a distributed system that might not even need the dedicated datastore at all anymore, but a sea of clients participating in coherency and replication. Albeit this has oodles of consistency and availability problems that very well might kill the whole concept, like how MOESI would absolutely fall over when trying to hotplug CPUs at arbitrary times.
Modern commercial distributed file systems do the type of caching you’re talking about. More generally, a Distributed Lock Manager [1] gets you halfway there. You’re right that a write-through cache is still necessary to achieve redundancy requirements, but especially in the context of file systems you can often coalesce local edits until the user explicitly issues stage equivalent of a flush operation, and only do the write-through then.
[1] https://en.m.wikipedia.org/wiki/Distributed_lock_manager
My advice is don't do those things in the blog and keep the DB and the app in the same region it will save you so many problems. If you can't just use a DB that was designed for that like Spanner or Cockroach.
I'm a big fan of "pinning" mechanisms where any time a user performs a write to your database you set a cookie that lasts for a few seconds and causes any subsequent requests from that user to go to the lesser, not a replica.
This solves the stale-read-after-write problem and ensures users will see any changes that they have personally made.
This does assume a mostly-reads, occasional-write application (which defines most of the applications I've worked on in my career).
So given that kind of application, what are the problems I'm not considering that I should look out for with a read-replica on a different continent?
Also long distances will make replication latency increase so unless _all_ requests are sent to the primary region you will likely have a read after write state data situation unless you go out of your way to ensure consistency
One thing about running thousands of DB clusters (which we did at Compose.com before this) is, these problems exist even without geo replication. And they exist at such a scale that you have to be good at handling them. Geo replicas are _more_ prone to these kinds of issues but we already had to be good at handling them.
We try to go out of our way to say that this stuff isn't a perfect fit for every application or a "best" way to do things. I think if a deployment like this is attractive for your application, you probably know it. I'm certainly not going to evangelize against people scaling up single-region databases!
Databases are like that, the place to get serious.
Sure I won’t let it deter me from a proper eval, but I could definitely see it scaring away suits.
I'd have a hard time calling something I was partially responsible for elegant, though, so this might just be our self deprecating nature rearing its head.
As a developer the way it was written scared me off. It reads like one of those recipe sites that game Google by adding a nonsensical story to the recipe. It seems like it took half the article to get to your "one trick". And ultimately you didn't deliver what the title teased.
But it really does result in globally distributed postgres.
I think it's safe to say that while we still love edge apps and are committed to them long term, we've switched gears towards full-stack applications (that is, we've gotten a lot more ambitious about what we want to be a good platform for). So this stuff is shaping up pretty quickly.
Postgres (and volume support) happened at about the same time as private networking did (Postgres was our first internal application for private networks). In January, we started doing user-mode WireGuard, so that `flyctl` runs its own TCP/IP stack and speaks WireGuard directly, without involving the OS (so you don't need root to use it). User-mode WireGuard gave us `flyctl ssh console`, and is also the way we do direct `psql`. There's more stuff in that vein coming; hopefully, it's gotten a lot easier to manage Postgres as `flyctl` has gotten more capable.
Where we want to be is for it to be trivial to run something like Postico.app to talk to a Fly Postgres database. You can do that now, but it's not yet _totally_ trivial. Hopefully you can sort of look at `flyctl` (it's open source) and see how that'll work.
Between you and me, we tried really hard not to build a Postgres service. We're committed now, it's gonna be good. For future DBs we plan to get big enough that people like Timescale to ship their products on Fly infrastructure.
You could run your own PG by modifying that app. Right now we're calling it "automated" and not managed, though. All alerts about health and other issues go straight to customers, we don't have DBAs that will touch these things yet.
We buffer up to 1mb of request data to be possibly replayed. We can set that arbitrarily high, it just seemed like a reasonable starting point.
Our infrastructure is all dedicated hosts, mostly running in Equinix facilities (we buy a lot of servers from what used to be packet.com).
New regions are hard in one particular way. Since we accept anycast traffic, ever new region has to be anycast-able for us. We outsource anycast management so we're kind of stuck waiting on other providers to opt in to a region.
When we get to a scale where we can manage our own anycast and decide we want to, we will be able to ship new regions relatively quickly. Some places are harder to get servers to than others, but we don't actually need a very large footprint to get started.
Soon anyone will be able to connect data sources by adding their read only credentials in the web (already possible via CLI but undocumented). The idea is to make exposing a database to the DDN as simple as exposing a website to a CDN.
We’ve designed all this to be multi-tenant and horizontally scalable, but we’re not actually running it on a distributed network yet. Personally, I’ve followed Fly for a long time and always loved the angle you’re taking. If any of you at Fly read this and want to potentially collaborate on a solution in this space, my email is in my profile.
We expose a pretty limited subset of sql through our APIs, and an even more limited form to public APIs, so I'm interested in how you're doing things providing joins to your users. "It exists but is super slow" is a reasonable answer as well :)
By default we don't copy any data into Splitgraph (although we do cache query results keyed by AST). When a query arrives, we parse it for any table identifiers, and "mount" a temporary foreign table for each, using the appropriate FDW. Foreign tables are mostly transparent to the Postgres query planner, so a JOIN will work (example [2]), and an FDW can even optionally implement aggregate pushdown. But yes, especially in the case of cross-dataset JOINS, there are some pathological cases. When your use case outgrows live querying (e.g. due to pathological JOIN queries), you can ingest any data source into a Splitgraph image, a versioned snapshot of data similar to a Docker image. [3]
Our focus right now is on the internal enterprise use case. We use the Socrata plugin as a demo, but our customers are more interested in plugins for data sources they use internally, e.g. Postgres, Snowflake, Google Search Console, CSV in S3, etc. We support all those too; you can mount them all locally, and you can also mount them on Splitgraph.com (or an internal deployment), but still thin docs / no web UI yet.
[0] https://www.splitgraph.com/blog/foreign-data-wrappers
[1] https://github.com/splitgraph/splitgraph/blob/master/splitgr...
[2] https://www.splitgraph.com/workspace/ddn?layout=hsplit&query...
I am now slightly disappointed.
Question about fly.io - how strict is the docker image requirement? How easy would it be to deploy a go application + systemd service?
We don't use Docker to run containers; in fact, we don't run "containers" at all. Instead, we transmogrify OCI containers into root filesystems for Firecracker micro-vms. Your "container" runs as a virtual machine with its own Linux kernel. You can do whatever you'd like with it.
Building the go binary is not a problem, but determining how the go binary runs, what it needs in order to be functional, etc. is where systemd comes into our setup.
And they are allowing multi-master writes? What is the update collision resolution? cell timestamps? vector clocks?