The major culprit here is MySQL load and we've been working on optimizing this (those interested can refer to my previous thread on this: https://news.ycombinator.com/item?id=6348903).
For now we've found that restarting the httpd and mysqld services brings things back to normal almost immediately.
While we continue to work on a - more elegant - solution, we're thinking of writing a bash/shell script that runs on hourly cron, checks load average (via uptime or /proc/loadavg) and if found higher than a threshold, restarts the services.
Can anyone think of any downside to this (used at least as a temporary measure)?
Also, consider creating a buffer in the application layer that buffers inserts/updates and executes them once as a single transaction, if they don't need to be executed immediately. It puts less stress on the database. Of course, this would require a lot of rewriting in your app, so not sure if you want to go through this route.
Indices are another area. I'm sure plenty of people have told you to optimize your indices, but also consider REMOVING unnecessary indices. Do you have an index on a text column, or multiple varchar columns? Those can be killer after awhile because inserts will slow down. Consider changing indices on varchar columns to indices on an int column by hashing those strings.
A quick suggestion: Install NewRelic (it's free for a certain period), and check out the database transactions that are taking up the most CPU load. Sometimes there's that 1 query you overlooked that is table scanning and could be the main culprit.
Also, are you using Rails by any chance? If so, there are other areas I can suggest.
And please post your server specs. Maybe your VPS just does suck (no offense), and the easiest route is just to upgrade your server.
MySQL performance tuning and table structure/query optimization-wise, we've left few stones unturned (but I'm sure there still is a scope for further improvement - isn't there always?).
Not using Rails.
Server specs: 4 cores, 3.2GB RAM
>and the easiest route is just to upgrade your server.
Please refer to the other thread for more on this.
Thanks again.
And how many reads are there usually a min? writes?
If you're doing something that makes any money you should get a real server before wasting days of effort or intentionally causing repeated outages.
I think your first problem is that you are using a VPS. You should never use a VPS in a high load situation like this - buy a dedicated server! They only cost about $70/month, which you should be able to afford if you have a successful site. Ideally you should get as much RAM as you can afford and/or an SSD drive.
I know you said you didn't want to throw hardware at the problem, but there are limits - you can't run a massive database on crappy hardware and expect it to work smoothly.
As mentioned in OP, it is MySQL.
>you can't run a massive database on crappy hardware and expect it to work smoothly.
Absolutely... but we do wish to milk the VPS for every cent of its worth and we're not sure we've got there yet.
Thanks for your input.
That's not specific enough.
Yes, the problem is manifesting itself via MySQL, but the bottleneck will be somewhere else; whether RAM, CPU, or I/O-wait.
You need to examine the restart process and analyze why it resolves the issue. If the reason is the abandonment of dead parasitic processes and memory leaks, you need to find out why and correct them. If the reason is that the restart unceremoniously drops all the current transactions, you need to increase capacity.
> Can anyone think of any downside to this (used at least as a temporary measure)?
I certainly can -- a bunch of really irritated visitors, whose transactions are abandoned. But that's only true if that is actually what's going on. Make sure you don't have software issues that are preventing efficient operation. If that's not the issue, you need to grow with your customer base -- increase server capacity.
Like I said in OP, we've identified MySQL to be the primary bottleneck and are already working on resolving this.
>whose transactions are abandoned.
At least some saving grace here... we're not selling anything... it is purely a content site
Thanks for your input.
Ah, yes. I remember from your prior post that you have very large databases and table sizes and are considering (or have begun) partitioning the largest tables. It turns out there is an innate partitioning scheme built into the most recent MySQL versions, but it has to be compiled into the running binary by way of a compiler flag:
http://dev.mysql.com/doc/refman/5.1/en/partitioning.html
I apologize if you've already heard this, or if it's already in place as a partial solution.
> SHOW PROCESSLIST
This will show all active queries and the time they have taken to execute. The fact that the server seems to churn to a halt and then work its way through the problem suggests the issues are related to specific queries you can catch this way. Then use the EXPLAIN command on the slow queries to figure out why they are hanging your server and add indexes or tweak that part of your code (avoid joins on large tables, etc.) as necessary.
Is it prohibitively expensive/time consuming to get (or borrow) a bigger machine (on EC2, or in your colo, or what have you) to run MySQL on until you've figured out how to shard / scale out your application?
For the rest, this thread might give you some more info on the issue: https://news.ycombinator.com/item?id=6348903
As is being discussed in the other thread.