There's a very solid solution to this that isn't as widely known as it should be.
Read after write consistency is extremely important. If a user makes an edit to their content and then can't see that edit in the next page they load they will assume things are broken, and that the site has lost their content. This is really bad!
The best fix for this is to make sure that all reads from that user are directed to the lead database for a short period of time after they make an edit.
The Fly replay header is perfect for this. Here's what to do:
Any time a user performs a write (which should involve a POST request), set a cookie with a very short time expiry - 5s perhaps, though monitor your worst case replica lag to pick the right value.
I have trust issues with clocks in user's browsers, so I like to do this by including a value of the cookie that's the server-time when it should expire.
In your application's top-level middleware, look for that cookie. If a user has it and the court time has not been reached yet, send a Fly replay header that internally redirects the request to the lead region.
This guarantees that users who have just performed a write won't see stale data from a lagging replica. And the implementation is a dozen or so lines of code.
Obviously this won't work for every product - if you're building a chat app where every active user writes to the database every few seconds implementing this will send almost every piece of traffic to your leaders leaving your replicas with not much to do.
But if your application fits the common pattern where 95% of traffic are reads and only a small portion of your users are causing writes at any one time I would expect this to be extremely effective.
Fly replay headers are explained in detail here: https://fly.io/blog/globally-distributed-postgres/
Chris McCord describes how Elixir does that with PostgreSQL here: https://news.ycombinator.com/item?id=31434094
Wikipedia implements this trick on top of PHP and MySQL global transaction IDs (GTIDs) so it definitely scales!
SELECT WAIT_FOR_EXECUTED_GTID_SET($gtidArg, $timeout)
https://github.com/wikimedia/mediawiki/blob/434c333d9b2be817...I wonder if there's a PostgreSQL equivalent of this?
I think FoundationDB does something really interesting with this problem. When you make changes, you do it via a transaction. But all the client reads are using the previous version, until the transaction changes have propagated across the nodes, then the new value is returned.
For something like this to be useful I think the code would need to be running on the user's network. That would drop server ping to sub 1 ms and open up a whole lot of interesting possibilities. But I don't see what changing server ping from 80 ms to 15ms gets me.
If you are using Phoenix then LiveView is the obvious approach to dynamically updating a page based on server stuff. It's a similar-ish architecture to HTMX, but integrated into the framework. The page is rendered on the server as normal, then when it loads on the client a web-socket is opened to a task on the server (page includes the LiveView JS). Then when something changes on the server, some new HTML generated and then the parts that have changed are sent down the websocket to the client to insert into the page. LiveView is part of Phoenix, leverages Elixir's concurrency, is very performant and a joy to use.
HTMX is a way of getting similar functionality but for a conventional server rendered framework like Django which doesn't have any of this stuff built in. It would be challenging to build it in anyway because the concurrency isn't as powerful. Simplistically, Phoenix exists because Chris McCord was trying to do a LiveView equivalent in Ruby, had issues, went on a search discovered Elixir.
So either use:
Elixir + Phoenix + Phoenix LiveView
Or:
Python + Django + HTMX (Python and Django can be substituted for other frameworks like Rails)
In both cases, Alpine can then be useful to sprinkle in some clientside only UI features.
The HTMX and alpine libs were intended to be sprinkled onto existing web apps (my usual python/flask stack), whereas Phoenix would be for building all new projects.
I recently started playing with Phoenix and the intro to channels and LiveView has been a bit confusing. E.g. a few days ago I wondered if it was worth using something like Svelte for the frontend and then realised I could just use LiveView. As a newbie to the ecosystem, it’s taking a while to get the lay of the land and start understanding the options.
> It seems like this would add a whole new class of bugs, like “I just submitted a form to change a setting and when the page reloaded, it still showed my previous value in the form” – since the write hadn’t propagated to the local read replica yet.
Elixir is distributed out of the box, so nodes can message each other. This allowed us to easily ship a `fly_postgres_elixir` library that guarantees read-your-own-writes: https://github.com/superfly/fly_postgres_elixir
It does this by sending writes to the primary region over RPC (via distributed elixir). The write is performed on a primary instance adjacent to the DB, then the result, and the postgres log-sequence-number, is sent back to the remote node. When the library gets a result of the RPC write, it blocks locally until its local read replica matches an LSN >= write LSN, then the result is returned to the caller
This gives us read-your-own-writes for the end-user, and the calling code remains unchanged for standard code paths. This doesn't solve all classes of race conditions – for example you may broadcast a message over Phoenix.PubSub that causes a read on the remote node for data that isn't yet replicated, but typically you'd avoid an N query problem from pubsub in general by populating the data in the message on the publisher beforehand.
There's no completely avoiding the fact you have a distributed system where the speed of light matters, but it's Fly's (and Phoenix's) goal to push those concerns back as far as possible. For read heavy apps, or apps that use caching layers for reads, developers already face these kinds of problems. If you think of your read-replicas as cache with a convenient SQL interface, you can avoid most foot guns.
I'm happy to answer other questions as it relates to Phoenix, Fly or what Phoenix + Fly enables from my perspective.
There have been many posts hitting the HN frontpage regarding fly.io recently. Is it healthy to have so much content about a single PAAS platform showing up here so often now?
> I wish more startups would achieve this, YC or not. Whenever I run across one that's trying to succeed on HN, I try to help them do so (YC or not)—why? because it makes HN better if the community finds things it loves here. Among the startups of today, I can think of only two offhand who are showing signs of maybe reaching darling status—fly.io (YC), and Tailscale (not YC).
Personally too both these companies are doing a lot of incredible things. I also love Litestream, phoenixframework and other things they are doing.
Beyond the western borders of this little town, the tech gold rush has both expanded to epic proportions, affecting all the economies in the world, and also gone through enough booms and busts that the phrase "gold rush" seems somehow off.
As more and more young'uns join and jaded veterans return to throng the tavern alike, it often seems to be on the brink of either exploding with the largest gun fight in history, or jumping the shark.
And yet, against all odds, it retains its original magnetism - drawing throngs that grow in number and diversity while seers like [https://news.ycombinator.com/user?id=patio11](https://news.y... and [https://news.ycombinator.com/threads?id=tptacek](https://new... continue to return - dispensing worldly wisdom worth its weight in gold from corner tables.
The secret is the man at the corner of the bar @dang, always around with a friendly smile and a towel on his shoulder. The only sheriff in the west who still doubles as the friendly bartender: always polite, always willing to break up a fight with kind words and clean up messes himself.
Yes a cold-hard look from him is all it takes to get most outlaws to back down, yes, his Colt-45 "moderator" edition is feared by all men, but the real secret to his success: his earnest passion (some call it an obsession) for the seemingly sisyphean task of sustaining good conflict - letting it simmer but keeping it all times below the boiling point based on "the code":
"Conflict is essential to human life, whether between different aspects of oneself, between oneself and the environment, between different individuals or between different groups. It follows that the aim of healthy living is not the direct elimination of conflict, which is possible only by forcible suppression of one or other of its antagonistic components, but the toleration of it—the capacity to bear the tensions of doubt and of unsatisfied need and the willingness to hold judgement in suspense until finer and finer solutions can be discovered which integrate more and more the claims of both sides. It is the psychologist's job to make possible the acceptance of such an idea so that the richness of the varieties of experience, whether within the unit of the single personality or in the wider unit of the group, can come to expression."
May the last great tavern in the West and it's friendly bartender-sheriff live long and prosper.
Same reason for a while the world seemed full of Rust articles -- at that point in time there wasn't (speaking as a C++ programmer) a pile of quality C++ articles around which the Rust was pushing out.
If there aren't new and different articles to fill the HN front page, then there has to be something.
And that something ends up being a base layer of blogs about "current stuff".
If it's getting posted to HN but not getting traction, for god's sake someone please let me know that too.
Literally the only thing we're trying to do is have HN be as interesting as possible. Missing out on the best content is disastrous for that goal—sort of like missing out on the best startups is disastrous for an investor.
We have a big announcement/technical post queued up --- we'd planned to run it on Monday --- and we're holding off on it because of the "organic" attention we're getting this week. We'd much rather talk about things like Litestream, app pentests, hiring processes, and how we replaced Nomad in our architecture.
But we're as aware as everyone else is that the front page has limited bandwidth, and we can't be on it all the time, so we're waiting for this (hopefully short) wave of attention to crest before we post our own stuff.
I never look at the person name when replying or voting, only the content. For example, I remembered this tptacek not because I remembered his posts, but because they get frequently mentioned in other people posts.
There's also a lot of momentum for the Elixir/Phoenix right now, and they're pretty tightly integrated with that community.
There are an awful lot of programming languages and methods that receive no hype. I have wondered about this, the most likely reason
Is that the majority of the crowd at the site come from a shared sphere, and to some agree the same type of priorities.
Personally, I like to stay away from the bleeding edge technology.
That does not constitute much of a problem for a lot of companies.
The annoying part is that recruiters often cram all sorts of technology into requirements for a CV and a job, even if the client has no need for those things, at least not yet.
There are millions or at least 100,000 of "enterprise" software projects out there.
I do admit it is not as sexy as the latest and greatest and start ups but it is a field where there are a whole lot of devs working in.
They do perhaps not spend as much time on HN, or they are quiet
Most applications and bigger chunk of them, are transactional and enterprise software is all about consistency and accuracy.
Nevertheless, I think its a great engineering fiat in and of itself anyway and hence gets discussed often probably could be the explanation.
The usecase I had for my startup (a few years ago, before fly.io) was "I have an Elixir/Phoenix application - stuff working in the background plus web frontend". I would like to host it with as little thinking about individual servers, load balancers etc. I went with GAE at the time and it was fine.
Fly.io seems like a much more streamlined version of the same thing, with the addition of "global load balancing" stuff on top, if I got to the point of caring about international customers.
Fly.io has a great combination of user virality/momentum and fundamentally technically interesting content on a wide range of topics.
Typically this type of thing goes in phases, and I wouldn’t worry about it, assuming you’re already OK with HN being biased towards YC-funded startups.
I think the lesson is partly that the typical somewhat-deranged writing style/topics are popular. More companies should try to write engaging blog posts and be more open if they want to be successful.
It seems to have paid off for them as I would guess at least some of the people trying it out are learning about fly.io from HN.
How does a container/database as a service platform competes with a Wireguard as a service platform?
Assessing if attending on the shoulders of the new Giants who stand up every few years is a difficult problem that I'm interested in and I appreciate the amount of context given here considering different use cases.
There are other alternatives like Render.com, railway.app, etc but it is clear that fly.io is unsurprisingly overhyped by the HN crowd, especially if you are looking for a Heroku alternative.
It’s like asking a barber if you need a haircut.
fly.io spends a tremendous amount of time on creating interesting technical content that attracts this type of attention. The company is intentional about this as a customer acquisition strategy. They have an illustrator on staff for their unique art style, for example. Their founder and senior technical staff engage with these posts and answer questions, etc.. It's not YC favoritism, it's a deep understanding of the developer first mindset / ecosystem and targeting it as a company strategy.
When we launched, we didn't do persistent storage for instances, so it didn't make as much sense to run ordinary apps here; rather, the idea was that you'd run your full-stack app somewhere like us-east-1, and carve off performance-sensitive bits and run them on Fly.io. That's "edge computing".
But a bit over a year ago, we added persistent volumes, and then we built Fly Postgres on top of it. You can store files on Fly.io or use a bunch of different databases, some of which we support directly. So it makes a lot more sense to run arbitrary applications, like a Rails or Elixir app, which is not something we would have said back in March 2020.
Worth noting that you don't have to use the distributed aspect. I have my site hosted on a single one of a fly.io's smallest instances (which one can get 3 of for free), and even like this the performance is excellent (50ms response times), and it doesn't have the problem of spinning down when not in use like Heroku's free tier.
It's nice to at least get a choice of regions. For example, the company I work for (not hosted on fly.io currently) only has customers in the UK and Ireland. So it's would be to be able to pop our servers there with a simple config setting.
I'm working on big and small projects/companies and that has never been any concern of ours.
I always imagined it to be something only the very very big players care about. And as a big player I would usually bet on a big partner like AWS, GCP, Azure. Or am I missing something?
I've built 3 adtech companies including all the tech and it's one of the few cases where data needs to be spread across global regions for latency and regulations. It's a lot of effort regardless of the underlying provider and not worth it unless you really have the scale and latency requirements.
You can receive an HTTP response from the other side of the planet in less than a second so server-side rendering and sending a single HTML page works just fine. The problem is actually all these client-side SPAs that make a dozen requests and are actually much slower because of it.
I've looked at implementing this in the past and always found it to be SO difficult that the benefit would not be worth the cost.
Fly has changed that equation for me. It has moved this problem from "I'd love to do it if I could but it's just too hard" to "This is a thing I could do with small enough engineering effort that it would be worthwhile".
This is my favourite type of technology: I love things that move something from the "too expensive" to the "now feasible to implement" bucket!
Like you said, it's mostly snake oil except for very big players.
If my app has a handful of users that are split between the US and Europe or Asia, and the app is 90% reads, then the distributed DB approach of fly.io or Cloudflare makes a lot of sense. It also adds considerable complexity though, so it's obviously a tradeoff.
While we encourage our customers to try and use us asynchronously, we have a number of enterprises that don't and therefore demand incredibly fast response times with low latency. They pay us accordingly, so as a result we have geolocated databases (in our case though, we are using AWS Aurora replication).
If so, locality jumps up to the top of the performance bottlenecks pretty quick and there is no amount of performance optimization you can do to fix it.
Fly.io also has a clean, highly usable CLI and minimal set of services unlike the hundreds of options on other providers. But that’s just icing on top—the volume support is the big advantage for me.
No doubts there are plenty of more niche uses (if I were serving users internationally, I’d probably use Fly.io), but the use case just doesn’t seem as broad as the Heroku/PaaS comparisons make it out to be.
And even if your page becomes really popular, 3 locations (Europe, US, East Asia) are enough to be <200ms to any user in the world. And it keeps your setup and cost much lower.
One region works just fine for some apps. Some are worth going to three.
All of this tech sounds cool, but like the author, I'm unsure when it's called for.
Current SPA trends are about deploying your app separate from the backend, often CDN-style close to the user (because speed of light matters). Most apps at scale use caching for reads on hot-code paths, so now we have "eventual consistency" in the mix.
Elixir is distributed out of the box, so while "global distribution" sounds fanciful, it's literally baked into the Virtual Machine, Fly simply gives us a private ipv6 network across the globe. All Elixir sees is a cluster of hosts that it can connect to, and it's off to the races.
What I'm getting at is all this snark actually describes most application folks build today at any scale, and we build distributed apps with Elixir because it's a distributed platform.
> All of this tech sounds cool, but like the author, I'm unsure when it's called for.
Imagine if you could write your dynamic UI with realtime updates, and you didn't have to bootstrap JSON apis, or GraphQL schemas for it. Imagine doing `PubSub.broadcast(room, "new_message", ...)` and it gets sent globally to all your instances – with no external dependency. Want to show some activity on the page when something happens on the cluster? Broadcast the event, then write 3 lines of code to update your UI. Imagine writing "naive" template code that renders some markup, but what falls out is smaller payloads on the wire than your carefully typed and specified GraphQL schemas that require serialization rules for all your objects. Imagine doing away with all that and gaining all the benefits of payload size.
If that sounds interesting, Phoenix + LiveView would be called for any time you wanted a dynamic UI or realtime updates and bonus points if you care about writing less code and killing layers of abstraction. Fly would be called for for the same reason folks use CDN's today, to serve resources of the app close to the user, except we just serve the app there instead.
For the long tail of Rails/Django/Laravel apps sitting on Heroku or a pair of EC2s in Virginia, who are looking at SPAs with trepidation, I think the case is less obvious.
Sorry if the snark was excessive and thanks for your reply!
There are 3 relevant (for this comment) "performance layers" in building software:
- Cycle time of a team or of the project - this is affected the most by language/framework choice, DevOps infrastructure, and team working style - this should be measured in days/weeks
- Feedback loop for an individual dev working on a new ticket - this is based on the team's cycle time but in addition is really about the dev environment, team collaboration, how the team maintains quality, and how well-defined work is before being started - this should be measured in muinutes/hours
- Performance of the software deployed in terms of response time to end users - milliseconds
Fly.io helps the most with category #3. But how often is that really the most important issue in choosing where to deploy your app? If an alternative made small sacrifices there (for example went form 99.99% performance to 99%) but gained velocity for individual devs and the team to be able to ship better product more quickly, would the company/project be better off?
At Coherence (www.withcoherence.com) - disclosure that I'm a cofounder - we're laser-focused on a post-Heroku development platform that goers further than Heroku on categories 1 & 2 above (where I'd argue Heroku is still the gold standard) rather than focusing on category 3.
We're super early but in closed beta - if it sounds exciting please check us out and request a demo on the site!
The distributed features are there for when you need them – I don't think you have to use them. Or am I missing something?
If you are setup to do some kind of a round robin read from the read replicas you can often get a different read from what you wrote as the value hasn't replicated to your read replicas yet. The solution is to use the write endpoint when reading after a write.
He says that here but just wanted to point out that it can happen inside an api and cause real issues with data.
Some databases are cleverer about this. Things like Spanner and FoundationDB work differently so as to be fast to both read and write, but they’re much more complex to operate and use.
There is another quick trick though… if a client performs a write, set a bit in their session that causes all of their reads to come from the primary database for a short period, maybe a few seconds, just enough to cover the replication latency. This is a hack. It’s got a lot of downsides, but it’s a quick way to patch the problem if truly necessary.
Here's a post that benchmarked multi-region Postgres (Elixir/Phoenix on fly.io): https://nathanwillson.com/blog/posts/2021-09-25-fly-multi-db...
According to the post, for some users (residing in Japan, and the primary instance being located in Amsterdam), a query could take ~200ms (median). If multiple queries are performed for each request, that could mean 1 second or more per API call - not so great if that's the case for multiple seconds after each write. I think this would eventually lead to putting more code in stored procedures, begging the question: why not use a distributed DB like Fauna in the first place?
Alternatively, the replication problem could be accounted for in the app itself. E.g. the SPA or the edge instance could retry reads following a write until the change from the primary instance has propagated, and up until then pretend that everything went fine. In case a write isn't replicated within 10 seconds or so, show an error to the user and let them retry the write action. This could lead to duplicate entries, but I'd estimate the chance for that to be quite low.
For this ACK that you talk about. It's AWS aurora mysql specifically here do you know if that's a setting you can setup?
Redis instances are single-region single-replica, for example.
On another note, as soon as they offer serverless functions and solid redundant Redis + SQL I'll be thinking about moving some of our production services over there for a test run.
- you need an elb - and this is not cheap (if you have an efficient backend, the elb will dwarf compute costs) - no persistent volumes, and you are encouraged to use gcs or firestore - each region requires a new deployment. No big deal but certainly not super easy to automate, esp. given the need to run behind an ELB (which you need on gcp to have a WAF) - google sdks for some languages suck big time. Most of python sdk is not async friendly, unbelievable as it seems.
I do use cloud run for projects big and small, and rather like it, but its hardly a competitor to fly.io imho.
Cold start latency kinda ruins that no?
Its generally fast as is, and if you want to pre-allocate a minimum # of instances, cold starts are less of a problem (basically only if you are suddenly stampeded by a spike in traffic). But the whole smart routing and localized storage concepts are left for you to implement. You can have a bunch of cloud run services behind an ELB that does geo-proximity based routing but firestore is region-bound, and cloud spanner can be very expensive. Not saying there are no workarounds but it seems to me fly.io offers a much lower cost of entry here.
When a request comes in to write on a read server that attempt a db write, the request is aborted and replayed on the main write server.
With some clever assumptions such as “get requests rarely write to the db” and “post request usually do”, much of the write traffic can skip the read vms.
They created a ruby rack middleware\2 to standardize this pattern for Ruby on Rails.
I think the author is talking about the complexity of dealing with read after write situations.
I just dropped DigitalOcean because of their price hike. No hard feelings. I was barely using it, and the product is growing more towards full-featured apps and teams, which is not as good a fit for me, an individual just screwing around. I don't fault them. I'm not their target customer.
Fly.io is very much designed for use primarily via their CLI tool. Their web interface needs some polish. But it does everything it says on the tin, for a price that's more than reasonable.
I only used Heroku briefly so I can't comment on similarities or differences with any authority.
As someone who is already very comfortable with container-based development, I'm happy with fly.io.
One big miss, though, is you’ll still need a database and s3, so I’m not sure if I totally understand the value.
I plan to use fly.io + planetscale and I'm hoping to still get low latency between those two services but it's no where near the low latency cloudflare can achieve with their new edge redis/db offerings (or the fly.io db at edge strategies) but after looking into fly.io's db strategies I really feel hesitant to take on that level of devops/additional-engineering when something like planetscale provides so much value out of the box.
Hope the fly.io team has something in the works either way! (And I'd love if they chime in with any input here in terms of performance between fly.io and existing DBaaS providers that are regionally replicated by default.)
html { -webkit-font-smoothing: "antialiased" }
10 years later, still relevant - https://usabilitypost.com/2012/11/05/stop-fixing-font-smooth...
Secondly, the issue may be the font being used. I don't recognize it, but probably not optimized for modern web screens. Or a "converted" typeface originally designed for print