All that being said, I really respect the detailed response from a technical perspective as well as owning up to (and the decisions that went into) a spell of downgraded performance.
Later edit because I don't want to spam the comments: I'd love some context (maybe from cperciva himself?) around the performance enhancement of integrating new Intel AESNI instructions. This is well beyond my depth and while Colin mentions that it didn't necessarily increase performance, I'm wondering if the hope is it would longterm? Or were there other benefits to such an integration?
I was using OpenSSL for that (which was using a software implementation). The code (you can see it in spiped) now detects the CPU feature and selects between AESNI or OpenSSL automatically. Given that the tarsnap server code was spending about 40% of its time running AES, it's a nontrivial CPU time saving.
I should probably have been clearer in my writeup though -- using AESNI was never a "once I roll this out everything will be good" fix. Rather, it was a case of "I have this well-tested code available which will help a bit while I finish testing the real fixes".
This ties in to the last lesson I mentioned at the bottom:
5. When performance drops, it's not always due to a single problem; sometimes there are multiple interacting bottlenecks.
Every time I identified a problem, I was correct that it was a problem -- my failing was in not realizing that there were several things going on at once.
Very common! One thing that's been helpful for us is establishing predefined system performance thresholds that, if exceeded, initiate the chain of events that will lead to customer communication. "If X% of requests are failing, then we had better advertise that the system is degraded." Discussing and setting these thresholds in advance and the expectation that they'll result in communication helps drive the right outcome. It's not perfect, because one is always tempted to make a judgment call in the circumstance, which is vulnerable to the same effect, but it's a good start.
Thanks for sharing!
My workaround has been to make something else responsible for sending the email. In a team, this could be a manager setting a cut-off point after which communication must be made. When working on my own, I set an alarm for X minutes. When that alarm goes off I ignore the internal voice which says "just try one more thing, then send the email", and send an update to let the relevant people know my current progress, ETA to fix, and when they can expect the next update.
I think this is similar to how GTD encourages us to use systems for storing to-do lists instead of trying to remember them - our fragile human brains are not always to be trusted.
Very much of the time I feel, "If I knew what the problem[s] [was|were] it'd be solved by now!" That's not exactly true of course but of course diagnosis is a large part of the total solution.
This type of an answer that Colin gave above does not exactly win friends and influence people in most situations where you're part of a team or hierarchy. Can anyone share what they've done to give better answers in these cases? I understand why people want the answers, but I don't have them to give right away particularly when it's Someone Else's system.
That is, as engineers we tend to want details. All the details. We want to know what happened, why it happened, how it's going to be fixed, and how long that will take. Because we want all that detail for ourselves, we hesitate to contact our customers/boss until we have all the details. Combine that with a desire to fix problems as they come up, and you end up with, "I never told you there was a problem because I was always one fix away from the solution."
But most people are not engineers. They want to be acknowledged. They want to feel informed, even if they have less details than what you would like to provide for them. Sometimes, something as simple as, "We've noticed that there is an issue and are currently working on a fix," goes a long way. Also don't be afraid to pull out, "Users have been reporting issues with backup performance. We do not currently believe this represents a service failure, but we are working to return performance to normal levels."
Your users trust you (otherwise they wouldn't pay you). If you "believe" something, they will too.
n.b. Our backups run outside of the hotspot times for Tarsnap, so we may have had less performance impact than many customers. I have an old habit of "Schedule all cron jobs to start predictably but at a random offset from the hour to avoid stampeding any previously undiscovered SPOFs." That's one of the Old Wizened Graybeard habits that I picked up from one of the senior engineers at my last real job, which I impart onto y'all for the same reason he imparted it onto me: it costs you nothing and will save you grief some day far in the future.
"AccuracySec=" in *.timer files lets you specify the amount of slack systemd has in firing timers. To quote the documentation "Within this time window, the expiry time will be placed at a host-specific, randomized but stable position that is synchronized between all local timer units."
You may still want to randomize timers locally on a host too, but the above makes automated deployment of timers that affects network services very convenient.
Yes, that sounds about right. I had maybe half a dozen people write to me who had noticed performance problems, and after the initial "backups failed because the server hit its connection limit" issue, it was people whose backups were already very long-running -- if your daily backups normally take 20 hours to complete, a 40% slowdown is painful.
FWIW, I live in Australia (so an 'off-peak' timezone), and schedule my cronjob on an odd minute offset, so it may not have been an issue for me anyway!
Add some metadata for a machine that tarsnap should expect a once a day/week/month backup from this machine, and if it doesn't get one, to send you an email?
Until the day when Colin considers it in-scope for Tarsnap, I recommend Deadman's Snitch for this purpose. I literally spend more on DMS to monitor Tarsnap than I spend on Tarsnap. No, I don't think that is just, either.
mental note: think harder, next time.
Years later, you've also become a cause celebre for holding true to a clear business and lifestyle vision (again, perceived at distance), in spite of the recommendations and 'support' provided by Patrick and others, including myself. Keep being true, and I suspect the community will keep learning from you Colin.
Hey Thomas, are you listening here?
In all seriousness, the picodollars do an excellent job of attracting exactly the sort of customers I want... and turning away the customers I don't want. They were originally part joke and part a way to avoid arguments with customers who don't understand that 1 GB < 1 GiB, but now it's way more than that.
in spite of the recommendations and 'support' provided by Patrick and others
Don't be too harsh on Patrick. His vision for Tarsnap is not my vision for Tarsnap, but he has helped me to orient myself: The projection of "business" onto the subspace "geek" doesn't look very much like "business", but it's not the same as "kid right out of university who has never had a real job" either, and that's what you would see if I hadn't had advice (from Patrick, Thomas, various YC people, and the rest of HN).
Advice can be very valuable even if you don't follow it to the letter.
Happy to help if you'd like.
Yeah, that has been a work in progress for a long time. FWIW, I started using piops volumes when they were the only SSD option available -- they beat the crap out of spinning ephemeral disks.
I backup some VPS servers to my NAS at home using attic over an SSH tunnel. Incremental backups are quite small and it's easy to automate with a simple cron job.
It's also got more efficient deduplication, because it doesn't use rsync's naïve algorithm.
The downsides: it requires the agent to be remotely installed (a la rsync: no "dumb" backends), and supports less storage backends to boot.
YMMV :-)
At 2015-04-01 00:00 UTC, the Amazon EC2 "provisioned I/O" volume on which most of this metadata was stored suddenly changed from an average latency of 1.2 ms per request to an average latency of 2.2 ms per request. I have no idea why this happened -- indeed, I was so surprised by it that I didn't believe Amazon's monitoring systems at first -- but this immediately resulted in the service being I/O limited.
A sudden doubling of latency can have dire consequences on any system. Knowing that such unexpected changes are possible makes it built trust in your environment, even if it is running fine today.
This is why I don't use AWS for anything non-trivial, and I am wary of people who put critical infrastructure on it. (EG: I Don't care about netflix, that service can run on AWS fine, but coinbase, for instance, if I was their customer and they ran on AWS I would stop being their customer.)
Whenever AWS problems come up people talk about how "AWS is so much more efficient, you just outsource that stuff to the experts".
But that seems to imply that hosting on your own hardware in your own office is the only alternative. Of course we stopped doing that in the 1990s.
With AWS you have to know Linux and have ops people, that's true everywhere. With AWS you have the additional burden of learning the AWS APIs and learning how to use AWS, which isn't transferrable, so that's a higher cost. With AWS you have to architect around the limitations of the way AWS is built and your architecture becomes AWS specific if you use those APIS, so that's an additional cost. You don't need any less ops people, probably more, than going with another hosting service like Digital Ocean or Backspace. And if you go with something like Hetzner you pay 1/5th to 1/10th for machines with a lot more performance and local storage. (Though you get the additional latency of being located in Europe, if your primary customers are the USA.)
Of course, I'm also prejudiced. I worked at Amazon and saw how the sausage was made and was not impressed. When AWS was announced as "running on the same infrastructure that powers Amazon.com!!!" as if it was a feature, I cringed. Amazon.com was having outages of parts or major components on a weekly basis at that time. Much of AWS is actually running on bespoke software (so not actually tested by Amazon.com when introduced, though I'm sure portions have been moved over at gunpoint) ... which actually makes it worse. People were trusting their data to a service that pretended to be backing a major e-commerce site but was actually untested outside of the company at the time.
And what have we seen since? An unacceptable level of failures. (in my opinion, of course)
But people seem to be very forgiving. When it's happening everyone's in "how can we fix this mode" and then when it's fixed everyone forgets and goes back to thinking of AWS as always running.
Ultimately though, even with Azure or AWS you're going to need people knowledgeable enough to administer your compute instances anyway, so why not just run your full stack on a bunch of VM's from DigitalOcean or Linode or rent a couple dedicated servers and throw oVirt on them; saving yourself a significant chunk of money at the same time.
If you absolutely need guaranteed IO performance, use an instance store or move to dedicated hardware. Them be the breaks of cloud computing.
http://en.wikipedia.org/wiki/Fallacies_of_distributed_comput...
For more data, why not just use one of the many compressed, deduplicated, encrypted, incremental backup systems (attic comes to mind, I'm sure there are others) then just sync to S3 at a tenth the cost?
Edit: not to mention they offer actual support not just "contact the author" email link as a last resort.
"While the Tarsnap code itself has not been released under an open source license, some of the "reusable components" have been published separately under a BSD license"
http://www.tarsnap.com/oss.html
The source code for tarsnap is available to view, so you could audit/inspect it yourself, but it is not under an open source license.
$300/year at tarsnap
$36/year at S3
Other than that, Attic is pretty excellent too.
Maybe you're thinking of my bother (Graham)? He was teaching cello around that time period I think.
My mind is just completely blown right now.