undefined | Better HN

0 pointsscanr13y ago0 comments

We went old school. We have a whole bunch of MySQL instances that we've sharded (using consistent hashing). Queries are sent to all of the shards and then aggregated on one of our query servers.

We also aggregate as we go which means there's just less data around to have to store (the disadvantage being that we need to decide up front what we want to aggregate but that has been less of an issue). We can't do accurate unique counts on pre aggregated data but we've added hyperloglog into the mix as most of our use cases can tolerate a small amount of error.

We partition our MySQL tables, which means we can archive old data easily and don't need to run any repair jobs.

Disadvantages:

* Column oriented is a better fit for this kind of data (somewhat mitigated if the data isn't sparse and we store id's instead of values e.g. city_id rather than city_name).

* Schema changes are not pleasant. We found some deadlock with partitions and running alter table statements which locks up the entire server.

Advantages:

* It performs well enough for our needs

* Having SQL as a query language is very pleasant

* MySQL is well understood, which makes looking after it quite easy

0 comments

6 comments · 3 top-level

vosper13y ago· 2 in thread

Thanks for sharing this info! Of all the points you listed I think the most important for me is "MySQL is well understood, which makes looking after it quite easy". Personally I think that our analytics requirements can be satisfied by at least half a dozen different systems out there, so management, optimisation and maintenance will be major factors. For my part, these are my thoughts on the systems we've looked at:

1. Riak: K/V stores are conceptually simple, secondary indexes look nice, and the consensus from RICON was that scaling by adding nodes pretty-much Just Works. However, no-one at RICON seemed to use MapReduce, particularly not for real-time analytics. No SQL-like querying.

2. HBase: Built on popular technologies. Amazon provides it as a service, but altogether it's very complex (a lot of components, and we're not a Java shop) and we have had trouble achieving the performance we need - although it seems from various blogs and books that it must be possible.

3. Hypertable: Still testing it, hard to find much information about it outside of the main site.

4. Postgres: Everyone knows and loves Postgres. Scaling (and performance after scaling) are concerns - we might have to go down the sharding route, and may never be able to store raw events data as it comes in.

Haven't looked at Cassandra yet, but most likely will do that soon.

karterk13y ago

The trick to hbase performance almost always relies on getting the row key and column qualifier design right. Your keys and qualifiers should be chosen to exploit bulk scans. If you have any specific questions, shoot me an email I can give you some pointers (email in profile).

ismarc13y ago

For scaling postgres, have a look at PL/Proxy. You can basically partition tables across servers without knowledge of the fact that the other servers are even there. However, it doesn't easily cover you for fault tolerance and recovery.

amalag13y ago· 1 in thread

Did you try infobright?

scanrOP13y ago

We did run a test some time ago. We weren't in a position to pay for their enterprise edition and the community edition didn't have the right feature set for us.

Differences between the two here:

http://www.infobright.org/Learn-More/ICE_IEE_Comparison/

taligent13y ago

SAP HANA might be an option. Expensive but nicely implemented.

j / k navigate · click thread line to collapse