BTW consider donating if you use lichess.
It looks like the servers are individually managed via OVH or similar, rather than running their own gear in co-location or similar. Wonder why?
I wonder what the "Misc dev salaries" is for - only curious because it's a flat $5k
I really appreciate the benefits package for patrons. Thibault is zee best.
It's also weird seeing that they are still waiting on their provider to tell them exactly what was done to the hardware to get it going again, that's usually one of the first things a tech mentions: "ok, we replaced the optics in port 1" or "I replaced that cable after seeing increased error rates", something like that.
There are many red flags which beg questions.
That said, I stopped taking them at their word years ago, this isn't the first time they've had dubious announcements following entirely preventable failures. In my mind, they really don't have any professional credibility.
People in the business of System Administration would follow basic standard practices that eliminate most of these risks.
The linked post isn't a valid post-mortem, if it were it would contain unambiguous details of the timetables and specifics, both of the failure domains and resolutions.
As you say, a network connector could mean any number of things. Its ambiguous, and ambiguity in technical material is used to hide or mislead most times which is why professionals detailing a post mortem would remove any possible ambiguity they could.
It is common professional practice to have a recovery playbook, and a plan for disaster recovery for business continuity which is tested at least every 6 months, usually quarterly. This is true of both charities and business.
Based on their post, they don't have one and they don't follow this well known industry practice. You really cannot call yourself a System Administrator if you don't follow the basics of the profession.
TPOSNA covers these basics for those not in the profession, its roughly two decades old now, it is well established, and ignorance of the practices isn't a valid excuse.
Professional budgets also always have a fund for emergencies based on these BC/DR plans. Additionally, using resilient design is common practice; single points of failures are not excusable in production failure domains especially when zero-downtime must be achieved.
Automated Deployment is a standard practice as well factoring into RTO and capacity planning improvements. Cattle not Pets.
Also, you don't ever wait on a vendor to take action. You make changes, and revert when the issue gets resolved.
First thing I would have done is set the domain DNS TTL to 5 minutes upon alerted failures (as a precaution), and then if needed point the DNS to a viable alternative server (either deployed temporarily or running in parallel).
Failures inevitably happen which is why you risk manage this using a topology with load balancers/servers set up in HA groups, eliminating any single provider as a single point of failure.
This is so basic that any junior admin knows these things.
Outlandish workarounds only happen when you do not have a plan and you are dredging the bottom of the barrel.
The people behind lichess are very much professionals, have worked in companies before, and know about everything you're writing. However instead of building a business they decided to run a completely free and ad-free non profit living off donations.
You don't get the same budget doing that compared than a subscription base / ad supported service. That's true for the number of people maintaining it as well as the cloud cost you can afford.
If you look at their track record, uptime have been pretty good. Shit happens, but if you ask me it's worth it to have a service like Lichess that can exist completely on donations.
Is it that complicated for big tech to reply politely with the above statement when they suddenly disable your account for no obvious reason!
It is much more difficult for corporate cogs to have that level of care compared to someone who does their things with passion.
If a commercial provider told me they're dependent on a single physical server, with no real path or plans to fail over to another server if they need to, I would consider it extremely negligent.
It's fine to not use big cloud providers, but frankly it's pretty incompetent to not have the ability to quickly deploy to a new server.
By increasing the complexity you multiply the failure points and increase ongoing maintenance, which is the bottleneck (even more than money) for volunteer-driven projects.
Disclaimer: I'm not a network engineer so I may be misunderstanding the practicality and complexity of such a workaround.