* Our users were very understanding of what happened. We have received nothing but encouragement to keep on working.
* Some comments on HN were nasty. I'm glad to be 33 and not 23. Otherwise, I could have been driven away from building my product because of my own incompetence.
* Many commented on devs vs ops. The way I see it, I can ask a dev to supervise the work of an ops contractor. I can't hire ops + dev at this stage.
Any start-up has three main constraints: time, money and talent. These are not set in stone, you can use time to produce money (consulting), you can use money to buy talent (hiring), and you can even convert time in talent (training).
So, when people say "let professionals handle it". Well, no, my particular set of constraints won't allow me to do this. My budget for this is around $100/mo. In an event where I completely run out of money I'd have to take down the site indefinitely, which causes the same effect as an HD loss.
It is clear now that I lack enough resources to run a complex app reliably. My focus in the next months is procuring those resources (money) so I can put it back into the product (ops and devs).
1) Design talent - presenting the coolness in a way that others can understand
2) Technical talent - taking great ideas and composing systems to make them real.
3) Legal talent - to keep you covered from the folks who would want to kill you when you are successful.
4) Operations talent - making sure that you can keep doing what your doing over time
5) Sales talent - communicating what you are doing in a way that the picture appears in someone else's head, and the value is clear.
You need coverage on all 5 of those skill sets, you might find someone who can cover two or three (they will be in high demand) or you may need to recruit to fill them, but unless you have them all you're highest risk of failure will be that undefended flank.
well, drat. I was talking about launching a "tested backups" service, but it's just not really worth my time until you get to the $500/month level or so, and I'd probably want a setup fee on top of that.
(For that, I'd give you a full working replication of your production site, hosted on my stuff- something that, in case of emergency, you could cut over your dns and just run with. Something that you could go to at any time and check on by going to yourdomain.backups.prgmr.com or something. Obviously, this would take me setting up some sort of replication of your database. Obviously, this also means that I'd need to know your application well enough to figure out how to make the running backup not conflict with the primary, and how to cut over to the backup /as/ a primary and how to cut back.)
I mean, I can do basic backups really cheaply, but testing them? that... that takes effort. Effort and understanding the application. And untested backups, meh, there's no reason for you to pay me to do it (maybe you pay me for space, but that's the cheap part.) there are thousands of services that will cheaply give you a place to hold files.
Huh. Most of the work, on my end, would be up front. What if I charged you $100/month, but made you pre-pay a year in advance or something? that might be worth it for me. (assuming I had the option to back out and refund your money within the first X days should your application prove to be too difficult to replicate.)
A programmer can create in a afternoon a feature that can be sold for a 100K+ dollars. And get the credit, glory and chicks. Or at least part of the credit.
To value properly admin you have to have had your ass pulled out of the fire from a good one a couple of times. Which comes with experience.
In a brief email to the board, boasting of another hard day's work making the company leaner and faster he says: "The office is already squeaky clean, why would we need cleaners?"
Not valuable my ass.
Although I guess people still forget that anything can and will fail at some point in time. That automated backup wont work one day, it'll be corrupted or just wont run.
To randomly pick an example - sure, automatic filesystem snapshots are a cakewalk these days, and a decade ago they were rather expensive. It seems logical to assume that things that we needed admins to do and rig up somewhat delicate systems for a decade ago are so easy now, we don't need people to focus on that...
This overlooks the fact that the baseline has just moved ahead. Sure you don't need dedicated people for stuff you used to.. but there is new stuff out there that your competitors are hiring dedicated people to work on and push the envelope. if you're okay with doing what you could have done 10 years ago, just using reduced staff, that's great - but it's not going to win you much.
"I'm a sysadmin by profession but when I write code I don't
want to have to worry about backups, scaling, etc. Those
things get in the way of creating the product."
This is how you get yourself into the type of problems OP is talking about. You can't rely on infrastructure to intelligently save you. You can't rely on SAAS options to solve your design issues. If you aren't asking questions from the beginning like:"How will I scale my product?"
"How will it handle failures?"
"How is it going to work at scale?"
Then you're going to be dealing with complex troubleshooting issues in a fragile infrastructure.
Don't try to replace the engine mid-flight.
It's entirely possible to write code in ways that make it really difficult to backup or scale and it's your job to avoid this and any other gotchas.
An obvious example, would be taking a binary copy of a database as a backup means. I've seen this a ton of times, though most good admins know there are better ways to back them up. How do they know this? I'd hazard a guess because the programmers were aware of the problem, documented it, and wrote the tools to get around the issue.
I imagine you really meant to say "I don't want to manage...", which is fair enough as long as someone else is doing it. :)
As I understand it, they sell VMs running some form of Linux that can have more RAM/storage/bandwidth dynamically allocated which takes out hardware related worries.
In terms of software though, there are still quite a few problems that need to be thought about.
For example:
* How do we backup files and databases? Where to, and how often? Are we just duplicating every so often or do we want snapshots at certain periods that we can revert to?
* How do we deal with software failures, like FS corruption?
* How do we update our software stack, when do we update it and how do we test that an update hasn't broken anything?
* What about if we want to concurrently run 2 versions of the same framework for different apps?
* How can we configure firewalling etc to allow trusted people to connect to the database, but block the people who spam the login form every 5 seconds?
* How do we make sure the software is configured correctly? Like charset encodings in the database, making sure that we have the correct modules installed into apache/php or the right gems installed etc?
* How do we manage background tasks, like cronjobs etc?
* How do we manage alerts when things fall over? Nagios etc.
A lot of the answers to these are going to depend on specific requirements for the project so are going to require some ops know-how to set up correctly. Or is it more the case that a cloud provider gives you a specific set up with limited options and you make everything fit around that?
Or is there some magic that goes on which I am missing?
Things like dealing with software failures, updating stack versions, configuring DB, are pretty much the same with EC2 as they are with dedicated hosting. There are Amazon products that help (Cloud Monitoring service) but mostly you will be doing it yourself the same way you would on your own hardware. Being a cloud VM hosts, some of these things are more convenient to handle than if you were on your own hardware. For example, everything can be done from the EC2 API, so you can programmically spin up/down instances (machines) as things go down to keep everything working. But of course you need to set up this failover/auto-scaling system yourself (the API just lets you control the infrastructure).
This is the case with IaaS services like EC2 and Rackspace (you only get bare VMs with some extras), but if you move to a more hand-holding PaaS service such as Heroku, where you get the entire deployment system and failure handling system, then your software stack management and failovers are mostly handled by the service provider. Of course these services cost a lot more for equivalent amount of compute power than IaaS services.
I did evaluate S3 for backups for one project , but concluded that an rsync script would be simpler and more portable.
In the case of things like heroku, how are software updates handled? Do you contact them and say "I want to update to rails version x.x , do it and run these automated tests" or do they just do everything on a schedule?
In other words, if you want an extra feature that is only present in a newer version is this possible? Also what I would worry about it them doing a random upgrade at an inconvenient time (like during a launch of something) and it breaking something subtle.
- Using other AWS services significantly decreases number of subject one has to figure out how to operate. Use simpledb/dynamo/rdb as the data store, route 53 for DNS, SQS as the queue service, ELB as the load balancer, SES for emails, etc. In addition, there are services provided by 3rd party services that one can use New Relic, App First for monitoring, MailChimp for email newsletters, etc.
It's not that you cant' do these things yourself, but installing, configuring and maintaining each of these solutions takes time, and most startups don't have the bandwidth and/or the expertise in all these. Letting AWS handle at least some of these makes it possible to do the rest of the tasks properly.
- Pay as you go structure makes it easy to create testing and staging environments that are almost same as production with minimal additional cost.
- I find that we don't spend much time troubleshooting OS related issues in production anymore. If there is a prb with one of the instances (and not the others), it often gets killed, and another one gets started within minutes.
The crux of the problem is the napkin math was wrong because he obviously forgot to a) include his billable time on learning such stuff and b) include the price of further staff to manage such things.
Personally I think there are tons of devs who can be good at ops, you just need to quickly come to the realisation that many of the aspects of ops are fairly essential non-optional functions of your job role, and should sit behind the writing of new code if your company has anything of value.
Once you come to these realisations you can quickly understand that despite a dedicated server being cheaper than paying for an email service, the time sink and required technical knowledge will quickly more than even the cost out, leading to this actually not being about ops for any other reason than an ops guy wrote it.
If "opening ports" is on the pain list, running your own mail server is going to feel like running a marathon while having a seizure.
1) Not all developers are good at ops. Issues aren't always related to the code. Platform choices come into play and all the characteristics of a particular platform aren't necessarily known to developers.
2) In companies where developers are pushed hard -- It's a startup!, We must deliver! (the typical death march) -- after 60+ hour weeks, one's trouble shooting skills aren't the best.
In the last 3-4 startups I've been involved with, I've been one of a small handful of people who can do ops as well as code (across a myriad of technologies) and have pushed for engineering to provide as much logging, documentation, and guidance to ops when the company deems them relevant; when ops is considered part of dev, I pushed for getting some basic ops in house (the two instances I can think of were the 'weekends are in the schedule' type startups).
The trend in software development has always been to abstract policy, process, and infrastructure into modular components when possible, and to allow experts to manage them. I think the demand for such services largely proves their efficacy in the marketplace.
The argument over whether a developer should learn ops is interesting, as the answer differs depending on what she intends to get out of building the application.
Remember, most applications (though not all) are intended as business endeavors. I think you need to look in the mirror and ask yourself-- "Am I building this application to serve a consumer's need? Or to become a better programmer / operations / systems engineer?"
In the case you're building something for a customer, time is your biggest and most important resource. Don't squander it by prematurely optimizing things. While it is admirable (and sometimes scalable) to invest in expanding your knowledge sphere, this often isn't the smartest business decision. Truth is, no matter how good you get with AWS, there is probably almost always someone else out there who is better than you and is offering their knowledge and experience as a service. And I can almost guarantee you your time (as a founder) will always be worth more than what this service costs.
I am an interaction designer, for example, but working with email got me thinking into ways of using it to support the interface. Instead of sending a reminder email for inactive users ("we miss you!"), you can send them a little interaction ("is this challenge good? yes or not). I doubt that insight would have come if I hadn't worked with email on the technical side.
Those are not backups, they are merely information stores. I have seen time and time again that restoring a site from zero is extremely painful due to how poor most developer tools are. In fact, some tools that are designed to "make things easier" actually don't make them easier for ops folks when things break. A personal bugaboo is startup scripts that aren't "portable" between shells. That might sound overly annoying, but when you discover that your SW isn't starting due to differences in environment variables in a dev vs prod environment that should never have been there, it'll make more sense.
Your business could potentially be dead overnight without backups.
This doesn't just apply to the operations of your site, this also applies to every bit of data related to your business. Do you have backups of everything on your local machine? Do you also have backups in a remote location incase your local backup gets destroyed/stolen? If not then stop everything until you've put a plan in action.
Who needs backups anyway? Everyone does, unless you don't care about your data.