Scaling to 100M: MySQL Is a Better NoSQL (opens in new tab)

(blog.wix.engineering)

394 pointsandreyvit10y ago175 comments

175 comments

129 comments · 52 top-level

koolba10y ago· 15 in thread

So much to disagree with here ...

> Locks limit access to the table, so on a high throughput use case it may limit our performance.

Then use a proper database that implements MVCC.

> Do not use transactions, which introduce locks. Instead, use applicative transactions.

Or just use a database that handle transactions more efficiently.

> `site_id` varchar(50) NOT NULL,

Why varchar(50)? UUIDs are 16-bytes. The best way to store them would be the binary bytes (which is how postgres stores them). If it's hex without dashes, it'll be varchar(32). If it's hex with dashes, it'll be varchar(36). Why did they pick 50? Future growth? Smart keys? Schema designer doesn't know what a UUID actually is?

> Do not normalize.

Bullshit. Normalize as much as is practical and denormalize as necessary. It's much easier to denormalize and it greatly simplifies any transaction logic to deal with a normalized model.

> Fields only exist to be indexed. If a field is not needed for an index, store it in one blob/text field (such as JSON or XML).

This is terrible advice. Fields (in a table) exist to be read, filtered, and returned. If everything is in a BLOB then you have to deserialize that BLOB to do any of those. That doesn't mean you can't have JSON "meta" fields but if your entire schema id (id uuid, data json) you're probably doing it wrong. It's next to impossible to enforce proper data constraints and all your application logic becomes if/then/else/if/then/else... to deal with the N+1 possibilities of data. Oh and when you finally add a new one, you have to update the code in M+1 places.

spotman10y ago

>> Locks limit access to the table, so on a high throughput use case it may limit our performance.

> Then use a proper database that implements MVCC.

InnoDB does implement MVCC. MVCC is not a silver bullet.

>> Do not use transactions, which introduce locks. Instead, use applicative transactions.

> Or just use a database that handle transactions more efficiently.

Easy to say, hard to implement at this scale. If you do a lot of writes and reads concurrently to a hot dataset, it's really quite hard to beat this architecture. This is why its such a popular and battle tested solution for many extremely high scale applications with workloads like this. Not to mention extremely well understood.

>> Do not normalize.

> Bullshit. Normalize as much as is practical and denormalize as necessary. It's much easier to denormalize and it greatly simplifies any transaction logic to deal with a normalized model.

But we are talking about performance... Having something in a single table that is denormalized is always going to be faster than having an elegant data model with "Everything In It's Right Place"

>> Fields only exist to be indexed. If a field is not needed for an index, store it in one blob/text field (such as JSON or XML).

> This is terrible advice.

So facebook/friendfeed, uber, dropbox, and many more or wrong then. Ok.

This is really all best practice for running something like this.

Of course it flies in the face of best practice for running a smaller system. Is there tradeoffs? Absolutely! Would it be smart to do this if the need for this scale is not obvious? Probably not.

You end up having more logic in your application and coordination layers, but this is all pretty good advice for people at this scale, and certainly not bad at all.

koolba10y ago

> Of course it flies in the face of best practice for running a smaller system. Is there tradeoffs? Absolutely! Would it be smart to do this if the need for this scale is not obvious? Probably not

From the article:

> The routes table is of the order of magnitude of 100,000,000 records, 10GB of storage. > The sites table is of the order of magnitude of 100,000,000 records, 200GB of storage

That's tiny. Both of those easily fit in memory on modern hardware. This isn't cough web scale, this is peanuts.

The savings from having a simpler system that operates both transactional and the lack of disparate CASE/IF logic would win over this monstrosity of a design.

For a counterpoint where this type of model makes more sense check out Ubers data model[1]. Similar setup but more applicable use case and (without having any inside intel on it) I'd wager is justified.

[1]: https://eng.uber.com/schemaless-part-one/

2 more replies

calpaterson10y ago

> Having something in a single table that is denormalized is always going to be faster than having an elegant data model with "Everything In It's Right Place"

This one folk wisdom that is untrue. There are significant speed disadvantages relating to large blobs of data the database doesn't understand. Serialisation time makes returning large JSON/XML objects expensive when you only need a small part. Overwriting a whole object to increment a counter is an unnecessary source of IO. Duplicating JSON keys in every record bloats the size of your working set, making it more difficult to fit into memory (or the fast part of your SAN).

99% of denormalisation out there is unnecessary and has inferior performance. The best route to performance with row store SQL databases (any database?) is two fold: 1) get an expert on your database to help you write your code and 2) get an expert on your hardware to help you choose wisely. Denormalisation is typically a way to take a performance problem and make it worse while introducing data corruption and race conditions.

1 more reply

viraptor10y ago

> But we are talking about performance... Having something in a single table that is denormalized is always going to be faster than having an elegant data model with "Everything In It's Right Place"

Unless you specify the workload, that's anywhere between completely true and exactly incorrect. Do you have big values you're always interested in and a couple of tiny ids? That's probably going to be faster in one table.

Are you querying only the metadata most of the time and the big value is multiple KB, almost never accessed? You're just killing your readahead and multiple levels of caches for no reason. "always going to be faster" is always incorrect ;)

1 more reply

wvenable10y ago

To be fair, this article is about using MySQL as a NoSQL storage and so all of this advice is geared towards that use-case. I'd kill for so much traffic that any of this would be necessary as opposed to any RDBMS best-practices.

I do agree that UUIDs should be stored differently -- the use of varchar rather than a fixed length type for a primary key will hurt performance.

agentgt10y ago

We use varchar for UUID (on postgres) which surprisingly hasn't been that terribly performance wise. And yes we do use varchar(36) although on postgres it doesn't really matter because I think almost all varchars are text.

I would love to switch to native UUID someday though.

3 more replies

calpaterson10y ago

This article is pretty terrible but just in point of fact: it looks like they are using MySQL's InnoDB backend - which does support transactions and MVCC. If they're even talking about avoiding transactions for speed purposes (no matter how stupidly) they must be talking about Inno because in MyISAM BEGIN and COMMIT are no-ops.

nerfhammer10y ago

The article says "Note that a transaction is using a DB-level lock that prevents concurrent writes—and sometimes reads—from the affected tables."

In innodb locks are row-level; myisam supports table-level locks though that's not and shouldn't be confused with a transaction; I don't know what a "database-level" lock is supposed to mean, are they really saying they're locking all tables to do a write? It doesn't sound like this author understands what a transaction is.

justizin10y ago

> If they're even talking about avoiding transactions for speed purposes (no matter how stupidly)

As many have pointed out, so much criticism of this article ignores that it is comparing to other key-value stores which are not transactional. Many of what this would compete with are AP, with Consistency not guaranteed.

It really sounds like they should be talking about MySQL cluster, which is protocol compatible but a completely separate implementation and essentially a key-value store with RDBMS attributes atop it. It supports many-master mode like mongo and other distributed systems, which is fairly mandatory for replacing them. It's hard to argue you can replace HDFS with anything that's not distributed, and if you didn't, why wouldn't you just use .. the actual FS? The author may not really understand that HDFS is optimized for storing large-ish files.

justinhj10y ago

There is nothing terrible or stupid about avoiding joins/transactions for speed. The technique of using blobs of data in MySQL rows in this article is perfectly valid and widespread at this point. As long as you understand the trade offs.

1 more reply

jdiscar10y ago

The point of this article is showing how MySQL could be used to get a lot of what a NoSQL solution provides. NoSQL certainly has a place, but a lot of people don't really understand what that is and simply use NoSQL because it's popular, which cuts them off from a lot of useful features a SQL solution could provide them. That said, you're right more care could have been put in the details of the article, but a lot of the points could be correct for their situation.

For example, 'Do not normalize.'

This was in the context of a read heavy table that competes with NoSQL. In that context, I think this is accurate. We noticed a big difference after denormalizing when we went from millions of rows to billions of rows.

The general advice of SQL solutions being as useful as NoSQL to a certain scale is good. I don't think the individual examples are horrible, but they aren't universal advice to achieve NoSQL performance.

tagrun10y ago

>> `site_id` varchar(50) NOT NULL,

> Why varchar(50)? UUIDs are 16-bytes.

Why do you think it's a UUID?

> The best way to store them would be the binary bytes (which is how postgres stores them).

Is it actually better than a pair of BIGINTs?

wtetzner10y ago

>Why do you think it's a UUID?

Because of this:

>Also notice that we are not using serial keys; instead, we are using varchar(50), which stores client-generated GUID values

1 more reply

viraptor10y ago

> If it's hex without dashes, it'll be varchar(32)

Or just char(32) no need to note the length if it's always the same.

return010y ago

This is not terrible advice in general. It is also not good advice in general. To make things scale, you obviously have to "break the norms". They give an insight how their specific case works. It's "watch, learn and pick what fits you" material.

tiffanyh10y ago· 8 in thread

Since Wix is using MySQL as a key-store ... I wonder why they didn't look at using Postgres HStore [1].

HStore is a key value store built directly in the RDBMS of Postgres.

[1] http://www.postgresql.org/docs/9.6/static/hstore.html

luka-birsa10y ago

We've drank the NoSQL coolaid mostly as we prefered a schemaless approach to our database and couchdb looked like a cool thing to use. Tested, deployed in production abd after a while we figured out that most of the promises about performance, stability, etc we're mostly bull.

HStore was released, we've migrated to PG and we can't be happier. Zero issues so far.

suneilp10y ago

Couchdb isn't exactly the best thing to judge NoSQL by today especially by it's old performance issues, lack of automatic compaction, indexing on demand instead of proactive background indexing, etc.

1 more reply

gshulegaard10y ago

Just curious, but why use HStore instead of JSONB?

1 more reply

BinaryIdiot10y ago

> Tested, deployed in production abd after a while we figured out that most of the promises about performance, stability, etc we're mostly bull.

Well, to be fair you would typically expect a database that's the equivalent of a remote hash table to be pretty much as fast as you can get. Now I don't have any experience with couch but most of the other key value stores I've used they scream with performance. But if you're doing anything beyond basic manipulations then it's going to require a lot of tuning depending on the solution you went with.

But RDMS can be very similar. Both are useful tools when used correctly and there is a huge amount of overlap in terms of capability.

1 more reply

bpicolo10y ago

Hstore has a lot of issues in my experience. Very hard to query through most ORMs, expensive indexes, strings only, not nested data, lots of unoptimized parsers out there relative to json. Use jsonb if you want postgres KV storage.

Hstore is more k->k->v, which is the same but different, and also leads inexperienced developers to model entire relationships in a single column

atomic7710y ago

Presumably they preferred to stay within the mysql world. I didn't read the article in detail, but I couldn't help but wonder why not use innodb more directly [1]? Imagine how much CPU is wasted using a full blown RDBMS SQL engine on top of InnoDB just to do key-value read/writes.

I have some experience with this having worked on an experimental storage engine for mysql that we connected to a transactional in-memory k-v store. The performance penalty for simple k-v workload through mysql was quite substantial, though our storage engine code was probably not sufficiently optimized. It would be interesting to explore this for innodb though

[1] https://dev.mysql.com/doc/refman/5.6/en/innodb-memcached.htm...

agentgt10y ago

IRC If you are using JDBC (which is pretty much the only choice on Java) HStore is a pain to use since its query operators conflict with JDBC's parameter syntax ("?").

yoava10y ago

At the time we started, MySQL was more mature. Today, having a lot of MySQL installations at Wix we have no need to try HStore as well. Having said that, it we would have stared today we may have considered it, among other options.

BinaryIdiot10y ago· 6 in thread

> Use client-generated unique keys. We use GUIDs.

Minor note but wouldn't UUIDs be better since they're time based? Sure it's really unlikely to hit an already used GUID but an UUID makes it impossible.

In fact is there a use case where it's better to use GUIDs over UUIDs? I couldn't think of one but I could be omitting something from my thinking so I'm curious.

Edit: apparently GUID and UUID are the same thing and GUID is simply Microsoft's original implementation of UUID. All this time I had no idea...

tqwhite10y ago

They are the same thing. Both should include time as well as the server address, etc.

Vendan10y ago

note there are like 5 versions of UUID, and only v1 and v2 include time and server address. It's also considered bad practice to use them, as it makes your UUID's guessable.

1 more reply

BinaryIdiot10y ago

Ah you're right. GUID is Microsoft's implementation of UUID. I guess, much like how many refer to tissues nowadays as Kleenex, the terms kinda got mixed around. At least in my experience from seeing how they're used.

1 more reply

rspeer10y ago

UUIDs of all formats are universally unique, for all practical purposes.

Consider UUID4, the one with 122 random bits. The birthday paradox says that you would need about 2^61 UUIDs before you expect even one duplicate. If this concerns you, you might not recognize how big 2^61 is.

(edited because I was originally talking about 2^64, but there are 6 non-random bits in UUID4)

dbenhur10y ago

What does "expect even one" mean? This is a probability equations, so what p does expect correspond to?

Let's make this real concrete, with 122 random bits, you can issue a million UUIDv4s every second for the next 100 years and still have a less than one in a million chance that you issued a duplicate.

https://lazycackle.com/Probability_of_repeated_event_online_... n = 5316911983139663491615228241121378304 (2122) p = 0.000001 => m = 3260955271619137 3260955271619137/(100000086400365) => 103

rco878610y ago

GUID == UUID

jlas10y ago· 6 in thread

> When someone clicks a link to a Wix site... That server has to resolve the requested site from the site address by performing a key/value lookup URL to a site.

So Wix uses MySQL to resolve site routes internally? Is this the best way to do it? Would it be possible to use internal domain names and rely on DNS to resolve everything?

CraigJPerry10y ago

Define best but it's a pretty reasonable approach. Some DNS servers (powerdns is an example from memory) use mysql over more traditional back ends like dbm which aren't so hot for high volumes of zone changes. I imagine a site like wix could be pretty tough on DNS.

Re: nosql, I'm coming at that with really positive experiences in Cassandra but I can't imagine what kind of DNS system it would be a good fit for. The ability to tune CAP to fit DNS may be useful but in general I think of Cassandra as the solution you think about when your 2 node vertically scaled monster can't keep up.

jlas10y ago

It sounds reasonable, but with DNS you'd get geographic distribution for free, right? Won't you have to do something like sharding to achieve a similar thing with MySQL?

1 more reply

backslash_1610y ago

I'm pretty new to this and just learned some more about routing requests in a cloud service at scale. For a lot of services where the location of the resources might change (frequently sometimes), you want to handle the routing internally and not use DNS because of the lag times and complexity of TTL and caching.

For companies other than Wix I don't know what is used to handle it on the back end but I imagine it's either some specialized piece of hardware than can handle an insane load or some commodity hardware / cloud service & in-house software like here.

At the end of the day, and I'm sure I'm missing some edge cases, I think it's basically a service that provides a mapping of domain.com/user-resource-or-website to the location of their resources with no lag time when changes are made.

vkjv10y ago

IIRC, Github does this for Pages via an nginx module that queries MySQL.

http://githubengineering.com/rearchitecting-github-pages/

tracker110y ago

When I was at GoDaddy, we used a distributed Cassandra cluster to handle similar work... it worked very nice for a few key lookups (site, resource) ... most endpoints were static resources stored in C*, cached in local redis, and served via a load balanced application cluster.

golergka10y ago

On a site constructor with custom URLs created by millions of users for their websites every day?

rantanplan10y ago· 6 in thread

And PostgreSQL is a better MySQL so... all is settled?

stplsd10y ago

Is it? MySQL improved a lot since 5.1 days, you know.

matthewmacleod10y ago

That's true, and I don't doubt that many developers' opinion of MySQL is tainted by some of the issues in earlier versions. It's still difficult to see what newer versions offer over Postgres though - and Postgres has a lot of bonus features too (like the JSON storage types, which are sublime)

3 more replies

Buttons84010y ago

Has MySQL added support for Common Table Expressions (CTE) yet? CTE was added to the SQL standard in 1999.

I ask because I always miss this feature when querying MySQL.

1 more reply

TylerE10y ago

Latest postgres's have improved tons too. Seems like we're seeing 10% speedups with every point release. Some of the new parrellization stuff in 9.6 is really sexy.

Thaxll10y ago

MySQL is faster than pg, you should ask why Facebook is running the largest MySQL shop.

pmontra10y ago

Maybe because they started with a tiny LAMP system in 2004 and got stuck there to the point they invested considerable resources to write their own PHP interpreter and optimize MySQL.

From https://www.percona.com/blog/2014/03/27/a-conversation-with-...

"we had the MySQL engineering talent we needed to work with the Oracle team to get 5.6 ready for production at our scale."

"We all worked hard to adapt 5.6 to our scale and ensure that it would be production-ready. We found some issues after production deployment, but in many cases we could fix the problem and deployed new MySQL binary within one or two days"

"Performance regression of the CPU intensive replication was a main blocker for some of our applications" followed by a description of how they addressed that.

So it's not vanilla MySQL vs vanilla PostgreSQL. They tailored MySQL to their needs and keep honing it. What they do has little resemblance with what the other 99.9999% of companies do, and I'm probably missing a few 9s. Another excerpt from that post highlights the differences:

"For example, typical MySQL DBA at small companies may not encounter master instance failure during employment, because recent mysqld and H/W are stable enough. At Facebook, master failure is a norm and something the system can accommodate."

My take: if they started with and stuck to PostgreSQL they'd have to work on it as they did on MySQL.

4 more replies

mh-cx10y ago· 5 in thread

I have not heard about "Wix" before, but maybe they should have done some more research before picking this name. To a German this sounds like "wichsen" which means, well, "wank"[1].

[1] http://dict.leo.org/ende/index_de.html#/search=wichsen

lucb1e10y ago

As a neighbor (Dutch) I can totally understand it and I still laugh pretty much every time I hear flickr[1]. Still, if I finally found a cool name for my project after a long search (it often is), I'm not going to cancel it just because it "sounds like" penis in Arabic or something.

http://www.woorden.org/woord/flikker (tl;dr: homosexual)

chises10y ago

they picked the similarity to the german word "wichsen" in their campaign in germany... https://www.youtube.com/watch?v=4AKDZmsy5yo it says "everyday million of people are wanking – wanking changed my live – when my girlfriend felt asleep, i'm going to wank – i love wanking – my wife convinced me to wank – i'm wanking after my training – to be honest, we wank together most of the times – wanking is the future" and the hardest/badest part is the last sentence "make it by yourself – be a wanker"

narrowrail10y ago

Not only is this completely off-topic, they are a publicly traded company NASDAQ: WIX and they are currently valued at ~$1B.

mh-cx10y ago

Then please forgive my ignorance. I was not aware that they are so big. Even more though I'm surprised about their name. Reading it really feels weird to me.

yoava10y ago

We did hear about it, if a bit late.

Our response - https://vimeo.com/138432267

bchociej10y ago· 4 in thread

Kinda begging the question aren't we? I turn to nosql for things that aren't key-value, generally.

dyeje10y ago

What do you mean? Isn't NoSQL inherently key-value?

tremon10y ago

All structured data can be represented as key-value, that includes SQL. They just differ in what constraints are used for the keys and values.

As for your question, NoSQL datastores can be grouped into multiple categories:

- column stores (like hadoop, cassandra, informix), which optimize for sharded and distributed storage of related data elements

- document stores (like elasticsearch), which focus on metadata organization for large opaque (binary) objects

- key-value stores (like redis, openldap), which are basically unstructured, associative arrays (hash maps). They allow the most storage freedom, and are hardest to optimize.

- graph databases (like neo4j, trinity), where more information is carried in annotated inter-object links than in the objects themselves.

LionessLover10y ago

Are graph databases key/value? Are document databases key/value?

Try this:

https://en.wikipedia.org/wiki/NoSQL#Types_and_examples_of_No...

https://www.youtube.com/watch?v=qI_g07C_Q5I

afandian10y ago

No. Triplestores and Graph databases aren't key-value stores. There's more to NOSQL than key value stores although most examples seem to be.

TheGuyWhoCodes10y ago· 3 in thread

This is basically "We made it work, easy, all the rest are wrong". Wix is 10 years old, they probably started with MySql and stuck with it, is it wrong? Maybe, maybe not. If they were to start today would they have used Mysql aswell or gone with another solution? Did they spend the last 10 years building tools to help them scale MySql (at which point it's easy for them to operate) rather than use a tool that had multi server or multi DC in-mind.

Oh and citing statistics without details is plain lying, how many server, how much RAM, SSD based or HDD....

whalesalad10y ago

Reddit does something very similar. https://kev.inburke.com/kevin/reddits-database-has-two-table...

I think the important thing to note here is that there are lots of different ways to use any given tool that can fit your use case without being an atrocity.

TheGuyWhoCodes10y ago

And they did it because of maintenance problems not (just) performance (from the link you provided). But they also said "Postgres was picked over Cassandra for the key-value store because Cassandra didn’t exist at the time. Plus Postgres is very fast and now natively supports KV." Which isn't patronizing like the article OP linked.

http://highscalability.com/blog/2013/8/26/reddit-lessons-lea...

1 more reply

sciurus10y ago

> If they were to start today would they have used Mysql

Uber built something similar in the last couple of years.

https://eng.uber.com/schemaless-part-one/

So did Dropbox.

https://blogs.dropbox.com/tech/2016/05/inside-the-magic-pock...

jjoe10y ago· 2 in thread

Scalability is like an abstract painting. It's unique to one's infrastructure. Its writing or sometimes postmortem makes good brain fertilizer. Not so much more. Beyond that I wouldn't rush to implement scalability du jour.

A setup that works for a certain service won't necessarily work for another unless yours is a very close replica. Based on my experience in this area, and I'm a performance seeking nut, each platform, and even each traffic pattern, needs its own thinking hat.

That's what makes it so fun!

return010y ago

Spot on! I bet there are hundreds of different stories like this with unconventional uber-hacks for performance.

sriram_sun10y ago

That is an awesome way to describe scalability! Can you give a couple of examples?

fapjacks10y ago· 2 in thread

This may be an unpopular perspective, but here goes. For many years I ran a business doing web development. I had many clients approach me who were using Wix, and who I could not help, because Wix had effectively taken hostage their images. Because of those years of bad experiences (telling clients that they are screwed unless they keep paying Wix), I do not trust Wix, and so I do not trust this post. Should those clients have trusted that Wix would make their data available in the future? No, totally not. But that is the cost of doing shady things. Everything with your name on it now gets taken with a grain of salt.

grossvogel10y ago

By "taken hostage their images," do you mean that literal graphic files uploaded to Wix servers were somehow made inaccessible to the user?

fapjacks10y ago

Yes, exactly. I don't know how they do it now, but Wix did the "one big Flash blob as a website" and did not make data available to clients to download once they had been uploaded. So images and other data that had been "compiled" into the Flash blob were erased or something. There was no warning about this, and it took many by surprise. This effectively forced people to renew their subscription to Wix who otherwise wanted to use something else. The worst was a friend of a friend whose elderly mother had used Wix to upload old family photos thinking that Wix was a safe place to store them, not knowing any better. I felt so bad for that woman. I have no love for Wix at all.

1 more reply

electrotype10y ago· 2 in thread

A little bit off topic, but I would like to hear more about using Solr [1] instead of any "real" NoSQL databases.

I don't have experience with MongoDb and such, but I've always asked myself why someone wouldn't use Solr as a distributed NoSQL database... Am I wrong or, with Solr, you get that key/value scalable storage AND you get advanced search features as an extra?

Why would I want to use MongoDb instead of Solr? What killer feature Solr doesn't have?

[1] http://lucene.apache.org/solr/

ddorian4310y ago

Haven't worked with solr but with elastic-search which are both based on lucene.

Some issues are:

async indexes, unable to modify/remove indexes, unable to grow/shrink number of shards etc (basically search why not use es as primary data store)

y0ghur7_xxx10y ago

We evaluated solr for a project just a week ago. It does not have authn & authz that mongo has, and that was a feature we needed. Other that that, if all you need is a kv store, solr is great.

Illniyar10y ago· 2 in thread

Two things that are sorely missing in this comparison to NoSql is:

How are they performing horizontal scaling, I'm guessing they aren't, without addressing the issue of sharding and scaling they can't really compare the solution to NoSql - it is the number 1 feature that NoSql has over RDBMS.

If they are achieving 1ms response time , then they almost certainly have the entire table in memory cache. What happens when the data grows beyond the size of the memory and it's not financially feasible to get a larger memory instance.

ysleepy10y ago

1. They probably don't need sharding, since the dataset is small enough to just replicate it in mirrors.

2. 1ms is achievable with SSDs, but 200K q/minute seems slow my gut feeling tells me.

This post is more like "ha we don't need NoSQL for this special use case" - Once you need scaling and some sort of atomics, you quickly have to use HBase for row-level atomicity and scaling.

Redis is probably better suited for the posted usecase anyway.

Illniyar10y ago

Why HBase? why not just shard your keys?

adenverd10y ago· 2 in thread

100M? Of course you'd scale an RDBMS for that, especially if you want searchability and analytics. It's way easier than a Hadoop -> Elasticsearch pipeline (or pick your flavor).

NoSQL databases are for BIG data. As in, billions of rows big.

tracker110y ago

I think it depends on the shape of your data... if your data is mostly collected in sets (as a single object base), and mostly key lookups, then a document store may be the best solution... Example, used to work for a mid sized classifieds site... most of the data was used as a single-listing query, and pulled in from a single base record. The SQL database was over-normalized and required a couple dozen joins if you wanted to see it flat... the system was crumbling...

Replicating the data to mongodb at the time, with a thin API for search and read queries, and omg, it was a staggering difference. Beyond just caching, all the search queries. Today, I'd be more inclined to use ElasticSearch (there was an issue with geo indexing at the time iirc)... just the same, it really depends on the shape of your data.

I feel that the storage shape should serve the needs of the application. SQL databases encourage normalization to such a degree, that it's costly to construct a semi-complex object in memory, especially when your queries will run across many tables for many thousands of users. Joins kill performance at scale... If you can get around that, you're often better off.

Duplicating data to a system that is a better fit for mostly-read/query scenarios is a good idea. There's nothing that says you can't have your data in two places, and it's easy enough to setup services that copy on update.

giaosudau10y ago

totally agree with that. 100M rows doesn't make any sense.

stevesun2110y ago· 2 in thread

System design 101: keep business logic into the layer above database layer rather than relying on specific db system to implement them. In this way to design an system, there shouldn't have any different between using MySQL of using NoSQL, their role is just storage engine. So, you don't need to follow the relational database practice, like for example, foreign key, constrains, normalization anymore.

beachy10y ago

Trying to avoid using foreign key constraints in a relational database is not "system design 101", its an instant fail.

When I cast my eye over a table with foreign key constraints, I am 100% certain that every single row conforms to those constraints, and always will.

By contrast, when the same table does not have constraints, but instead relies on some business logic layer to enforce them, then I have to consider whether there might be corrupt rows put in there by:

- bad business logic code

- bad import scripts

- some contractor who used to work for us 5 years ago and briefly uses his php script to push up some data

stevesun2110y ago

Sadly, you remind me the DBA-is-everything system design style. In modern system designs, a system need more than just a database system to store business states, and to encapsulate business logic into higher layer is not just have flexibility also have scalability. Take sometime to think about the following three scenarios:

Scenario 1: what if a system need to migrate to different database system, then the whole business logic are need to totally re-implemented with the destination system DSL.

Scenario 2: if system need more just one storage system to persist business states, for example, I use db to store image metadata and use s3 to store the image? I don't believe the foreign key constrains will still works.

Scenario 3: if we have system need to process business state in asynchronously, for example, use message queue.

Also think about how to do unit tests (this is also how we keep the business logic correct) how to do CI/CD. System design is more than just a ERD design.

1 more reply

okigan10y ago· 2 in thread

>An active-active-active setup across three data centers.

Any info how "active-active-active" (I assume 3 aws regions) is accomplished?

grossvogel10y ago

Some kind of master-master replication, probably: https://dev.mysql.com/doc/refman/5.7/en/mysql-cluster-replic...

jarnix10y ago

In my company we are using a mysql cluster based on galera, (percona xtradb server) and it's a master-master solution that is rather easy to deploy and maintain. The only limit was that we had to use a single server for writes (that would make the "cluster" thing kind of useless but it's not in fact, we are using a load balancer on top of the cluster and the load balancer decides of the "master" where writes go so it's transparent to our application), definitely worth a try.

jondubois10y ago· 2 in thread

The problem with SQL DBs is that they just weren't designed for distributed computation to begin with. SQL doesn't take into account CAP theorem - So it lets you write queries which work on a single machine but which cannot scale to multiple machines.

On the other hand, many NoSQL databases like MongoDB and RethinkDB have a query language which was designed to run on both single-machines and distributed infrastructure (in a homogeneous way); the same queries which work on a single machine will also work at scale on multiple machines - No need to rewrite your queries as your app grows.

You CAN scale with SQL but you have to know what queries to avoid (E.g. table joins, nested queries...) but with NoSQL, you don't have to avoid any queries; if it's in the Docs, it's safe to use.

Finally, a major difference between SQL vs NoSQL is the typed vs untyped structure. Most SQL databases were designed in a time when statically typed languages were mainstream; so it made sense for SQL databases to enforce static typing on their data.

On the other hand, NoSQL was designed in a time when dynamically typed languages where popular and gaining more popularity (E.g. Ruby, Python, JavaScript); when using these languages, having to add SQL-specific types to data feels like an unnecessary step. With NoSQL you can still enforce a schema in the application code but your schema logic doesn't have to abide by any type constraints from DB layer - Your schema is the ultimate authority on typing of your DB - If gives you the flexibility to be lazy with type-checking in the areas which are low-importance (where errors are tolerable) and strict where data type consistency is paramount.

Generally, NoSQL DBs impose constraints to query expressiveness in order to free you from architectural constraints. SQL DBs impose few constraints on query expressiveness but because of this, they add architectural constraints to your system.

timruffles10y ago

There's so much wrong with the above.

To pick a quick one: "query language which was designed to run on both single-machines and distributed infrastructure". Mongo has no fewer than THREE query syntaxes: standard, map-reduce[1], and the aggregate pipeline.

'homogeneous', lol.

[1] which even Mongo employees recommend people avoid like the plague https://www.linkedin.com/pulse/mongodb-frankenstein-monster-...

gaius10y ago

enforce a schema in the application code

HA HAHA HAHA

zzzcpan10y ago· 2 in thread

I thought nosql movement was about distributed systems, cap and all that. What does this "active-active-active" even mean? No consistency and no availability guaranties I presume?

cnlwsu10y ago

I may be reading this wrong, but I think they are purposing a not C not A not P solution... thats "ok" fast? They explain how to make a single mysql instance run as key-value but I dont understand how it becomes master-master or cross DCs. Wonder if they run Jepsen or do any partition tolerance tests given their mentioning it.

viraptor10y ago

AFAIK the only way to do 3 masters in vanilla mysql is ring replication. That means C is B's slave, B is A's, A is C's. If that's what they do, then yeah, it's a noCAP deployment. No consistency if you insert the same UUID at the same time into 2 masters. No availability unless you implement it yourself by retry to another master. No partition tolerance, because if you break one replication link, half the writes are not replicated between other servers and you can't really both reconfigure the ring and replay the transactions.

(Yes, they say active-active-active, not master-master-master, but then they say across DCs... It could be just M-S-S with config switch on failover, but for me the post suggests it's not that)

ph33t10y ago· 1 in thread

I hate these stupid "my db is better than whatever db" articles. 1) What db to be used depends on the situation AND MORE IMPORTANTLY 2) what experience your staff has

I can say that 10 years go, I would have chosen M$SQL over MySQL and it would have been the correct choice. At the time I had almost 10 years experience with M$SQL and almost none with MySQL. Now I have more than 10 years of MySQL under my belt. AND the MySQL experience is more current. Right now I could choose between the two based on specific features and performance characteristics. For me to pick Posgresql because of a specific feature would be insane because I don't experience with it. No knock on Posgresql ... maybe I'll spend time with it and pick it for some down-the-road project.

I have implemented couchdb as a caching solution. I know how to manage, backup, and restore the database server. I have managed a 5 node cluster. If you ask me to implement NoSQL, it would be my choice for 2 reasons: 1. It can do the job. 2. I have experience making it do the job.

I'm sure there are 10 million people out there would would choose mongo in the same situation. The would not be wrong and they may come up with a superior solution. For me to implement Mongo today would be wrong - I would almost certainly come up with an inferior solution. for them, it would be stupid not to.

I'm not saying "don't learn anything new". I'm saying "don't gamble your business on technology with which you're not familiar".

Its a bit like backups ... the most important thing about a backup is not the technology you use, but whether you are capable of restoring and maintaining the backups.

partiallypro10y ago

First off, I can't take you seriously after using a dollar sign in Microsoft. That's just a conversation killer for anyone talking seriously in tech. This isn't an IRC channel for 13 year olds in 2004.

Secondly, the article isn't about what is "better" overall. It's about scaling SQL, and how noSQL isn't always necessary. The "in" thing to do right now is to have noSQL in your stack, blindly, without looking at your project. Or doing expensive migrations to noSQL solutions when you already have an expansive infrastructure built on SQL but need to scale. Wix is just giving insight into their techniques with MySQL and how in the end it made more sense for them than going with something like Mongo.

TL;DR: You didn't read the article.

xrstf10y ago· 1 in thread

I wonder if using the memcached plugin for InnoDB[1] would speed things up even more, at the expense of not having flexible queries (and thereby introducing multiple roundtrips) anymore. Presumably, they are using simple "SELECT * FROM table WHERE id = ?" in most places anyway, so that could be an okay tradeoff to make.

[1] https://dev.mysql.com/doc/refman/5.6/en/innodb-memcached.htm...

stplsd10y ago

I wonder about this myself, anyone has experience using this? MySQL 5.6 brought many long awaited features, like schema update without locking tables https://dev.mysql.com/doc/refman/5.6/en/innodb-online-ddl.ht... (bye bye Percona tools).

sjwright10y ago· 1 in thread

I've been running a rather large website with MySQL for the past fifteen years. There was a period when I regretted that choice and used something else. Today I'm using MariaDB and the TokuDB storage engine, and I'm so thankful that I never migrated to Postgres.

Like many people I investigated the NoSQL movement for potential applicability, and almost swallowed the hype. As I investigated more, I realised:

1. There are some specific instances where a NoSQL engine makes good sense. They're a valid option and should be considered depending on the application. In my experience though, well formed RDBMS structures are the better option in the vast majority of applications.

2. Most of the hype and growth came from people who (a) were using the abomination known as ORMs which are the canonical example of a round peg in a square hole; and/or (b) didn't know how to build performant RDBMS schemas. For these people, the NoSQL engine was fast because it was the first engine they actually learned how to optimise correctly.

wainstead10y ago

> were using the abomination known as ORMs which are the canonical example of a round peg in a square hole

Indeed, the "Vietnam of computer science."

https://blog.codinghorror.com/object-relational-mapping-is-t...

chucky_z10y ago· 1 in thread

Everyone should try PostgreSQL with hstore (and JSONB now, too!).

This is a key/value store inside an RDBMS that just works, and it works great!

I converted a crappy sloppy super messy 1000+ column main table in a ~800GB database to use hstore, it was, in real world benchmarks, between 7x and 10,000x (yes, really, ten thousand times) faster.

The CEO of the company who had a technical say in everything, and was very proud of his schema "wasn't excited" and it never happened in any production instance.

I've left since then, and the company has made very little advancement, especially when it comes to their database.

Really, just use hstore. Try it out. The syntax is goofy, but... I mean, SQL itself is a little bit goofy, right?

combatentropy10y ago

> SQL itself is a little bit goofy, right?

The proper pronunciation of SQL is SQuirreL.

jamiequint10y ago· 1 in thread

Wouldn't Aerospike be a cheaper, lower maintenance, and more robust solution to this problem?

manigandham10y ago

Yes, a single instance would more than handle all of their load. 2 for HA/redundancy and they're all set. Setup some more pairs elsewhere else with active/active replication.

This is basically them failing to do enough research into existing solutions that would work far better.

markhops10y ago· 1 in thread

Stupid question here: what are serial keys, and how do they impose locks?

markhops10y ago

Ah, did he mean "SERIAL" as in "BIGINT UNSIGNED NOT NULL AUTO_INCREMENT UNIQUE" ... and the reason why this locks a table is because the database, on insert, needs to figure out a valid key?

peter_d_sherman10y ago

Aside from the MySQL vs. Other DB debate (which I refuse to take part of, although I'm willing to ascribe good points to all camps), this article is absolutely excellent with respect to acting as a guide for people who want to use MySQL as a Key/Value store. Absolutely stellar article! All of the points are dead-on. I applaud the author for putting together so much specific information about tuning MySQL for Key/Value in one place, and the ridiculous speed and scalability you can get if you do it correctly. (That being said, NoSQL Key/Value databases are good too.)

gshulegaard10y ago

For those interested in NoSQL (particularly MongoDB) you may find this an interesting read:

https://www.linkedin.com/pulse/mongodb-32-now-powered-postgr...

But over time I finding less and less reason to _not_ use PostgreSQL when contemplating a NoSQL document store.

wefarrell10y ago

I was expecting a comparison but they only presented one side.

jamesblonde10y ago

I thought this article would be about the true MySQL NoSQL system: MySQL Cluster (or NDB). It scales to 200m transactional reads per second - per second! http://highscalability.com/blog/2015/5/18/how-mysql-is-able-... We have got 16m read/sec on our commodity rack with MySQl Cluster, so it's not a fantasy result.

fleaflicker10y ago

From 7 years ago, by Bret Taylor (who went on to become CTO at Facebook after acquisition):

How FriendFeed uses MySQL to store schema-less data https://backchannel.org/blog/friendfeed-schemaless-mysql

Edit to add the HN discussion at the time: https://news.ycombinator.com/item?id=496946

EGreg10y ago

Here is my (albeit limited experience) advice:

1. Use PostgreSQL, or MySQL with InnodDB for row level locking

2. Huge tables should be sharded with the shard key being a prefix of the primary key.

If you need to access the same data via different indexes then denormalize and duplicate the index data in one or more "index" tables.

3. Do not use global locks. Generate random strings for unique ids (attempt INSERT and regenerate until it succeeds) instead of autoincrement.

4. Avoid JOINs across shards. If you use these, you won't be able to shard your app layer anymore.

5. For reads, feel free to put caches in front of the database, with the keys same as the PK. Invalidate the caches for rows being written to.

It's actually pretty easy to model. You have the fields for the data. Then you think by which index will it be requested? Shard by that.

Note that this will still lead you to a huge centralized datacenter!! Because your authentication happens at the webserver level and then you just have all the servers trust each other. While it is a nice horizontal architecture, it leads to crazy power imbalances like we have today. Consider instead making it a distributed architecture, where the shards turn into domains, and each user on each domain has to auth with every other domain. But your network can then be distributed without a single point of failure. What's more, local area networks will be able to host your app and be quick without the signal bouncing halfway around the world.

hifier10y ago

It seems to me that one of the core differences between MySQL/Postgres and distributed stores like Cassandra / Hbase is that with the former your data and your write workload have to fit onto a single host. If either one cannot fit then you have to partition at the application level or use a real distributed data store. Partitioning at the app level is an operational burden and complexity that would be best avoided, but there are always exceptions.

cmenge10y ago

So MySQL is great if you use none of its features, but then its really hardly different from all the other databases. So it's not the implementation, but the very promises that databases make which can't be held, but if you know that, you are just fine. Great insight, and pretty much the definition of NoSQL...

krosaen10y ago

related: friendfeed used a similar approach https://backchannel.org/blog/friendfeed-schemaless-mysql

mrmrcoleman10y ago

Very interesting post. I seem to remember that you were featured in a MongoDB 'success story' last year but they seem to have removed it now.

Does that mean you've stopped using Mongo altogether?

languagehacker10y ago

Just about as wrong as the day it was posted here in December

neeleshs10y ago

Awesome! Now lets do it for 1B rows, and then 10 Billion and then some. It is well known that for small datasets NoSQL is no better , if not worse, than an RDBMS.

ai_ja_nai10y ago

MySQL looks great when used as K-V because it avoids the bad planner (when you have a primary key as the only searching key, a planner is useless) and denormalizing avoids expensive JOIN ops.

But there is the awkward replication model, the lack of native data structures as column type and the lack of sharding support.

bechampion10y ago

I love mysql , saved my ass many times , but this article doesn't mean anything .. it just says that you can use subqueries and joins to do "nosql"... we all know that .. you can also use a text file. I'd like if mysql copies what postgres has done with hstore.

ecolak10y ago

"Many developers look at NoSQL engines—such as MongoDB, Cassandra, Redis, or Hadoop" Noone uses Hadoop as a database. On the other hand, HBase which uses HDFS as underlying storage is a great NoSQL database that we use in production.

return010y ago

This seems specific to their use case. He shows an example with a subquery. I wonder why they don't break that to two queries. The should be fast enough if cached, and would prevent the need for both tables to be unlocked during query.

eblanshey10y ago

> Do not perform table alter commands. Table alter commands introduce locks and downtimes. Instead, use live migrations.

Care to elaborate more on this? What do you mean by live migrations?

0n34n710y ago

Mongo comes with geospatial indexing baked right in. Never mind map / reduce. It comes down to the data structures of our times, which are increasingly not relational.

Sarki10y ago

So if I got it straight the message is: "Don't fall for the sirens of hype but instead make sure that your choice of technologies suits your needs"?

tacone10y ago

The first thing that comes to mind is that they write about read throughput, when the write throughput is a big selling point of many NoSQLs.

HolyHaddock10y ago

Does anyone know what they mean by `Instead, use applicative transactions.`?

andradejr10y ago

Anybody else thinks Postgres would have been a better comparison here?

1024core10y ago

I'm surprised there's no mention of HandlerSockets.

tomphoolery10y ago

Does this mean PostgreSQL is an even better NoSQL? ;-)

SliderUp10y ago

Is a 100 million 'scaling up' these days?

meshko10y ago

Upon reading this i have three questions for them: 1) Do you do backups? 2) Do you use source control versioning system? and last but not least 3) Why do you kill so many kittens?

return010y ago

Glad to see the nosql hype blowing down to reasonable levels. Next up: imperative languages back in vogue.

meeper1610y ago

I'm a purist and also need the absolute fastest lookups with out SQL overhead so I go straight for MDB (Sleepycat BerkleyDB) - faster than LevelDB or any others.

rubenolivares10y ago

Why can't autists just use whatever they want and stop trying to convince people that their way is best.

j / k navigate · click thread line to collapse

175 comments

129 comments · 52 top-level

koolba10y ago· 15 in thread

So much to disagree with here ...

> Locks limit access to the table, so on a high throughput use case it may limit our performance.

Then use a proper database that implements MVCC.

> Do not use transactions, which introduce locks. Instead, use applicative transactions.

Or just use a database that handle transactions more efficiently.

> `site_id` varchar(50) NOT NULL,

> Do not normalize.

Bullshit. Normalize as much as is practical and denormalize as necessary. It's much easier to denormalize and it greatly simplifies any transaction logic to deal with a normalized model.

> Fields only exist to be indexed. If a field is not needed for an index, store it in one blob/text field (such as JSON or XML).

spotman10y ago

>> Locks limit access to the table, so on a high throughput use case it may limit our performance.

> Then use a proper database that implements MVCC.

InnoDB does implement MVCC. MVCC is not a silver bullet.

>> Do not use transactions, which introduce locks. Instead, use applicative transactions.

> Or just use a database that handle transactions more efficiently.

>> Do not normalize.

> Bullshit. Normalize as much as is practical and denormalize as necessary. It's much easier to denormalize and it greatly simplifies any transaction logic to deal with a normalized model.

But we are talking about performance... Having something in a single table that is denormalized is always going to be faster than having an elegant data model with "Everything In It's Right Place"

>> Fields only exist to be indexed. If a field is not needed for an index, store it in one blob/text field (such as JSON or XML).

> This is terrible advice.

So facebook/friendfeed, uber, dropbox, and many more or wrong then. Ok.

This is really all best practice for running something like this.

Of course it flies in the face of best practice for running a smaller system. Is there tradeoffs? Absolutely! Would it be smart to do this if the need for this scale is not obvious? Probably not.

You end up having more logic in your application and coordination layers, but this is all pretty good advice for people at this scale, and certainly not bad at all.

koolba10y ago

> Of course it flies in the face of best practice for running a smaller system. Is there tradeoffs? Absolutely! Would it be smart to do this if the need for this scale is not obvious? Probably not

From the article:

> The routes table is of the order of magnitude of 100,000,000 records, 10GB of storage. > The sites table is of the order of magnitude of 100,000,000 records, 200GB of storage

That's tiny. Both of those easily fit in memory on modern hardware. This isn't cough web scale, this is peanuts.

The savings from having a simpler system that operates both transactional and the lack of disparate CASE/IF logic would win over this monstrosity of a design.

[1]: https://eng.uber.com/schemaless-part-one/

2 more replies

calpaterson10y ago

> Having something in a single table that is denormalized is always going to be faster than having an elegant data model with "Everything In It's Right Place"

1 more reply

viraptor10y ago

> But we are talking about performance... Having something in a single table that is denormalized is always going to be faster than having an elegant data model with "Everything In It's Right Place"

1 more reply

wvenable10y ago

I do agree that UUIDs should be stored differently -- the use of varchar rather than a fixed length type for a primary key will hurt performance.

agentgt10y ago

I would love to switch to native UUID someday though.

3 more replies

calpaterson10y ago

nerfhammer10y ago

The article says "Note that a transaction is using a DB-level lock that prevents concurrent writes—and sometimes reads—from the affected tables."

justizin10y ago

> If they're even talking about avoiding transactions for speed purposes (no matter how stupidly)

justinhj10y ago

1 more reply

jdiscar10y ago

For example, 'Do not normalize.'

tagrun10y ago

>> `site_id` varchar(50) NOT NULL,

> Why varchar(50)? UUIDs are 16-bytes.

Why do you think it's a UUID?

> The best way to store them would be the binary bytes (which is how postgres stores them).

Is it actually better than a pair of BIGINTs?

wtetzner10y ago

>Why do you think it's a UUID?

Because of this:

>Also notice that we are not using serial keys; instead, we are using varchar(50), which stores client-generated GUID values

1 more reply

viraptor10y ago

> If it's hex without dashes, it'll be varchar(32)

Or just char(32) no need to note the length if it's always the same.

return010y ago

tiffanyh10y ago· 8 in thread

Since Wix is using MySQL as a key-store ... I wonder why they didn't look at using Postgres HStore [1].

HStore is a key value store built directly in the RDBMS of Postgres.

[1] http://www.postgresql.org/docs/9.6/static/hstore.html

luka-birsa10y ago

HStore was released, we've migrated to PG and we can't be happier. Zero issues so far.

suneilp10y ago

Couchdb isn't exactly the best thing to judge NoSQL by today especially by it's old performance issues, lack of automatic compaction, indexing on demand instead of proactive background indexing, etc.

1 more reply

gshulegaard10y ago

Just curious, but why use HStore instead of JSONB?

1 more reply

BinaryIdiot10y ago

> Tested, deployed in production abd after a while we figured out that most of the promises about performance, stability, etc we're mostly bull.

But RDMS can be very similar. Both are useful tools when used correctly and there is a huge amount of overlap in terms of capability.

1 more reply

bpicolo10y ago

Hstore is more k->k->v, which is the same but different, and also leads inexperienced developers to model entire relationships in a single column

atomic7710y ago

[1] https://dev.mysql.com/doc/refman/5.6/en/innodb-memcached.htm...

agentgt10y ago

IRC If you are using JDBC (which is pretty much the only choice on Java) HStore is a pain to use since its query operators conflict with JDBC's parameter syntax ("?").

yoava10y ago

BinaryIdiot10y ago· 6 in thread

> Use client-generated unique keys. We use GUIDs.

Minor note but wouldn't UUIDs be better since they're time based? Sure it's really unlikely to hit an already used GUID but an UUID makes it impossible.

In fact is there a use case where it's better to use GUIDs over UUIDs? I couldn't think of one but I could be omitting something from my thinking so I'm curious.

Edit: apparently GUID and UUID are the same thing and GUID is simply Microsoft's original implementation of UUID. All this time I had no idea...

tqwhite10y ago

They are the same thing. Both should include time as well as the server address, etc.

Vendan10y ago

note there are like 5 versions of UUID, and only v1 and v2 include time and server address. It's also considered bad practice to use them, as it makes your UUID's guessable.

1 more reply

BinaryIdiot10y ago

1 more reply

rspeer10y ago

UUIDs of all formats are universally unique, for all practical purposes.

(edited because I was originally talking about 2^64, but there are 6 non-random bits in UUID4)

dbenhur10y ago

What does "expect even one" mean? This is a probability equations, so what p does expect correspond to?

https://lazycackle.com/Probability_of_repeated_event_online_... n = 5316911983139663491615228241121378304 (2122) p = 0.000001 => m = 3260955271619137 3260955271619137/(100000086400365) => 103

rco878610y ago

GUID == UUID

jlas10y ago· 6 in thread

> When someone clicks a link to a Wix site... That server has to resolve the requested site from the site address by performing a key/value lookup URL to a site.

So Wix uses MySQL to resolve site routes internally? Is this the best way to do it? Would it be possible to use internal domain names and rely on DNS to resolve everything?

CraigJPerry10y ago

jlas10y ago

It sounds reasonable, but with DNS you'd get geographic distribution for free, right? Won't you have to do something like sharding to achieve a similar thing with MySQL?

1 more reply

backslash_1610y ago

vkjv10y ago

IIRC, Github does this for Pages via an nginx module that queries MySQL.

http://githubengineering.com/rearchitecting-github-pages/

tracker110y ago

golergka10y ago

On a site constructor with custom URLs created by millions of users for their websites every day?

rantanplan10y ago· 6 in thread

And PostgreSQL is a better MySQL so... all is settled?

stplsd10y ago

Is it? MySQL improved a lot since 5.1 days, you know.

matthewmacleod10y ago

3 more replies

Buttons84010y ago

Has MySQL added support for Common Table Expressions (CTE) yet? CTE was added to the SQL standard in 1999.

I ask because I always miss this feature when querying MySQL.

1 more reply

TylerE10y ago

Latest postgres's have improved tons too. Seems like we're seeing 10% speedups with every point release. Some of the new parrellization stuff in 9.6 is really sexy.

Thaxll10y ago

MySQL is faster than pg, you should ask why Facebook is running the largest MySQL shop.

pmontra10y ago

Maybe because they started with a tiny LAMP system in 2004 and got stuck there to the point they invested considerable resources to write their own PHP interpreter and optimize MySQL.

From https://www.percona.com/blog/2014/03/27/a-conversation-with-...

"we had the MySQL engineering talent we needed to work with the Oracle team to get 5.6 ready for production at our scale."

"Performance regression of the CPU intensive replication was a main blocker for some of our applications" followed by a description of how they addressed that.

My take: if they started with and stuck to PostgreSQL they'd have to work on it as they did on MySQL.

4 more replies

mh-cx10y ago· 5 in thread

I have not heard about "Wix" before, but maybe they should have done some more research before picking this name. To a German this sounds like "wichsen" which means, well, "wank"[1].

[1] http://dict.leo.org/ende/index_de.html#/search=wichsen

lucb1e10y ago

http://www.woorden.org/woord/flikker (tl;dr: homosexual)

chises10y ago

narrowrail10y ago

Not only is this completely off-topic, they are a publicly traded company NASDAQ: WIX and they are currently valued at ~$1B.

mh-cx10y ago

Then please forgive my ignorance. I was not aware that they are so big. Even more though I'm surprised about their name. Reading it really feels weird to me.

yoava10y ago

We did hear about it, if a bit late.

Our response - https://vimeo.com/138432267

bchociej10y ago· 4 in thread

Kinda begging the question aren't we? I turn to nosql for things that aren't key-value, generally.

dyeje10y ago

What do you mean? Isn't NoSQL inherently key-value?

tremon10y ago

All structured data can be represented as key-value, that includes SQL. They just differ in what constraints are used for the keys and values.

As for your question, NoSQL datastores can be grouped into multiple categories:

- column stores (like hadoop, cassandra, informix), which optimize for sharded and distributed storage of related data elements

- document stores (like elasticsearch), which focus on metadata organization for large opaque (binary) objects

- key-value stores (like redis, openldap), which are basically unstructured, associative arrays (hash maps). They allow the most storage freedom, and are hardest to optimize.

- graph databases (like neo4j, trinity), where more information is carried in annotated inter-object links than in the objects themselves.

LionessLover10y ago

Are graph databases key/value? Are document databases key/value?

Try this:

https://en.wikipedia.org/wiki/NoSQL#Types_and_examples_of_No...

https://www.youtube.com/watch?v=qI_g07C_Q5I

afandian10y ago

No. Triplestores and Graph databases aren't key-value stores. There's more to NOSQL than key value stores although most examples seem to be.

TheGuyWhoCodes10y ago· 3 in thread

Oh and citing statistics without details is plain lying, how many server, how much RAM, SSD based or HDD....

whalesalad10y ago

Reddit does something very similar. https://kev.inburke.com/kevin/reddits-database-has-two-table...

I think the important thing to note here is that there are lots of different ways to use any given tool that can fit your use case without being an atrocity.

TheGuyWhoCodes10y ago

http://highscalability.com/blog/2013/8/26/reddit-lessons-lea...

1 more reply

sciurus10y ago

> If they were to start today would they have used Mysql

Uber built something similar in the last couple of years.

https://eng.uber.com/schemaless-part-one/

So did Dropbox.

https://blogs.dropbox.com/tech/2016/05/inside-the-magic-pock...

jjoe10y ago· 2 in thread

That's what makes it so fun!

return010y ago

Spot on! I bet there are hundreds of different stories like this with unconventional uber-hacks for performance.

sriram_sun10y ago

That is an awesome way to describe scalability! Can you give a couple of examples?

fapjacks10y ago· 2 in thread

grossvogel10y ago

By "taken hostage their images," do you mean that literal graphic files uploaded to Wix servers were somehow made inaccessible to the user?

fapjacks10y ago

1 more reply

electrotype10y ago· 2 in thread

A little bit off topic, but I would like to hear more about using Solr [1] instead of any "real" NoSQL databases.

Why would I want to use MongoDb instead of Solr? What killer feature Solr doesn't have?

[1] http://lucene.apache.org/solr/

ddorian4310y ago

Haven't worked with solr but with elastic-search which are both based on lucene.

Some issues are:

async indexes, unable to modify/remove indexes, unable to grow/shrink number of shards etc (basically search why not use es as primary data store)

y0ghur7_xxx10y ago

We evaluated solr for a project just a week ago. It does not have authn & authz that mongo has, and that was a feature we needed. Other that that, if all you need is a kv store, solr is great.

Illniyar10y ago· 2 in thread

Two things that are sorely missing in this comparison to NoSql is:

ysleepy10y ago

1. They probably don't need sharding, since the dataset is small enough to just replicate it in mirrors.

2. 1ms is achievable with SSDs, but 200K q/minute seems slow my gut feeling tells me.

This post is more like "ha we don't need NoSQL for this special use case" - Once you need scaling and some sort of atomics, you quickly have to use HBase for row-level atomicity and scaling.

Redis is probably better suited for the posted usecase anyway.

Illniyar10y ago

Why HBase? why not just shard your keys?

adenverd10y ago· 2 in thread

100M? Of course you'd scale an RDBMS for that, especially if you want searchability and analytics. It's way easier than a Hadoop -> Elasticsearch pipeline (or pick your flavor).

NoSQL databases are for BIG data. As in, billions of rows big.

tracker110y ago

giaosudau10y ago

totally agree with that. 100M rows doesn't make any sense.

stevesun2110y ago· 2 in thread

beachy10y ago

Trying to avoid using foreign key constraints in a relational database is not "system design 101", its an instant fail.

When I cast my eye over a table with foreign key constraints, I am 100% certain that every single row conforms to those constraints, and always will.

- bad business logic code

- bad import scripts

- some contractor who used to work for us 5 years ago and briefly uses his php script to push up some data

stevesun2110y ago

Scenario 1: what if a system need to migrate to different database system, then the whole business logic are need to totally re-implemented with the destination system DSL.

Scenario 3: if we have system need to process business state in asynchronously, for example, use message queue.

Also think about how to do unit tests (this is also how we keep the business logic correct) how to do CI/CD. System design is more than just a ERD design.

1 more reply

okigan10y ago· 2 in thread

>An active-active-active setup across three data centers.

Any info how "active-active-active" (I assume 3 aws regions) is accomplished?

grossvogel10y ago

Some kind of master-master replication, probably: https://dev.mysql.com/doc/refman/5.7/en/mysql-cluster-replic...

jarnix10y ago

jondubois10y ago· 2 in thread

You CAN scale with SQL but you have to know what queries to avoid (E.g. table joins, nested queries...) but with NoSQL, you don't have to avoid any queries; if it's in the Docs, it's safe to use.

timruffles10y ago

There's so much wrong with the above.

'homogeneous', lol.

[1] which even Mongo employees recommend people avoid like the plague https://www.linkedin.com/pulse/mongodb-frankenstein-monster-...

gaius10y ago

enforce a schema in the application code

HA HAHA HAHA

zzzcpan10y ago· 2 in thread

I thought nosql movement was about distributed systems, cap and all that. What does this "active-active-active" even mean? No consistency and no availability guaranties I presume?

cnlwsu10y ago

viraptor10y ago

(Yes, they say active-active-active, not master-master-master, but then they say across DCs... It could be just M-S-S with config switch on failover, but for me the post suggests it's not that)

ph33t10y ago· 1 in thread

I hate these stupid "my db is better than whatever db" articles. 1) What db to be used depends on the situation AND MORE IMPORTANTLY 2) what experience your staff has

I'm not saying "don't learn anything new". I'm saying "don't gamble your business on technology with which you're not familiar".

Its a bit like backups ... the most important thing about a backup is not the technology you use, but whether you are capable of restoring and maintaining the backups.

partiallypro10y ago

TL;DR: You didn't read the article.

xrstf10y ago· 1 in thread

[1] https://dev.mysql.com/doc/refman/5.6/en/innodb-memcached.htm...

stplsd10y ago

sjwright10y ago· 1 in thread

Like many people I investigated the NoSQL movement for potential applicability, and almost swallowed the hype. As I investigated more, I realised:

wainstead10y ago

> were using the abomination known as ORMs which are the canonical example of a round peg in a square hole

Indeed, the "Vietnam of computer science."

https://blog.codinghorror.com/object-relational-mapping-is-t...

chucky_z10y ago· 1 in thread

Everyone should try PostgreSQL with hstore (and JSONB now, too!).

This is a key/value store inside an RDBMS that just works, and it works great!

I converted a crappy sloppy super messy 1000+ column main table in a ~800GB database to use hstore, it was, in real world benchmarks, between 7x and 10,000x (yes, really, ten thousand times) faster.

The CEO of the company who had a technical say in everything, and was very proud of his schema "wasn't excited" and it never happened in any production instance.

I've left since then, and the company has made very little advancement, especially when it comes to their database.

Really, just use hstore. Try it out. The syntax is goofy, but... I mean, SQL itself is a little bit goofy, right?

combatentropy10y ago

> SQL itself is a little bit goofy, right?

The proper pronunciation of SQL is SQuirreL.

jamiequint10y ago· 1 in thread

Wouldn't Aerospike be a cheaper, lower maintenance, and more robust solution to this problem?

manigandham10y ago

Yes, a single instance would more than handle all of their load. 2 for HA/redundancy and they're all set. Setup some more pairs elsewhere else with active/active replication.

This is basically them failing to do enough research into existing solutions that would work far better.

markhops10y ago· 1 in thread

Stupid question here: what are serial keys, and how do they impose locks?

markhops10y ago

Ah, did he mean "SERIAL" as in "BIGINT UNSIGNED NOT NULL AUTO_INCREMENT UNIQUE" ... and the reason why this locks a table is because the database, on insert, needs to figure out a valid key?

peter_d_sherman10y ago

gshulegaard10y ago

For those interested in NoSQL (particularly MongoDB) you may find this an interesting read:

https://www.linkedin.com/pulse/mongodb-32-now-powered-postgr...

But over time I finding less and less reason to _not_ use PostgreSQL when contemplating a NoSQL document store.

wefarrell10y ago

I was expecting a comparison but they only presented one side.

jamesblonde10y ago

fleaflicker10y ago

From 7 years ago, by Bret Taylor (who went on to become CTO at Facebook after acquisition):

How FriendFeed uses MySQL to store schema-less data https://backchannel.org/blog/friendfeed-schemaless-mysql

Edit to add the HN discussion at the time: https://news.ycombinator.com/item?id=496946

EGreg10y ago

Here is my (albeit limited experience) advice:

1. Use PostgreSQL, or MySQL with InnodDB for row level locking

2. Huge tables should be sharded with the shard key being a prefix of the primary key.

If you need to access the same data via different indexes then denormalize and duplicate the index data in one or more "index" tables.

3. Do not use global locks. Generate random strings for unique ids (attempt INSERT and regenerate until it succeeds) instead of autoincrement.

4. Avoid JOINs across shards. If you use these, you won't be able to shard your app layer anymore.

5. For reads, feel free to put caches in front of the database, with the keys same as the PK. Invalidate the caches for rows being written to.

It's actually pretty easy to model. You have the fields for the data. Then you think by which index will it be requested? Shard by that.

hifier10y ago

cmenge10y ago

krosaen10y ago

related: friendfeed used a similar approach https://backchannel.org/blog/friendfeed-schemaless-mysql

mrmrcoleman10y ago

Very interesting post. I seem to remember that you were featured in a MongoDB 'success story' last year but they seem to have removed it now.

Does that mean you've stopped using Mongo altogether?

languagehacker10y ago

Just about as wrong as the day it was posted here in December

neeleshs10y ago

Awesome! Now lets do it for 1B rows, and then 10 Billion and then some. It is well known that for small datasets NoSQL is no better , if not worse, than an RDBMS.

ai_ja_nai10y ago

MySQL looks great when used as K-V because it avoids the bad planner (when you have a primary key as the only searching key, a planner is useless) and denormalizing avoids expensive JOIN ops.

But there is the awkward replication model, the lack of native data structures as column type and the lack of sharding support.

bechampion10y ago

ecolak10y ago

return010y ago

eblanshey10y ago

> Do not perform table alter commands. Table alter commands introduce locks and downtimes. Instead, use live migrations.

Care to elaborate more on this? What do you mean by live migrations?

0n34n710y ago

Mongo comes with geospatial indexing baked right in. Never mind map / reduce. It comes down to the data structures of our times, which are increasingly not relational.

Sarki10y ago

So if I got it straight the message is: "Don't fall for the sirens of hype but instead make sure that your choice of technologies suits your needs"?

tacone10y ago

The first thing that comes to mind is that they write about read throughput, when the write throughput is a big selling point of many NoSQLs.

HolyHaddock10y ago

Does anyone know what they mean by `Instead, use applicative transactions.`?

andradejr10y ago

Anybody else thinks Postgres would have been a better comparison here?

1024core10y ago

I'm surprised there's no mention of HandlerSockets.

tomphoolery10y ago

Does this mean PostgreSQL is an even better NoSQL? ;-)

SliderUp10y ago

Is a 100 million 'scaling up' these days?

meshko10y ago

Upon reading this i have three questions for them: 1) Do you do backups? 2) Do you use source control versioning system? and last but not least 3) Why do you kill so many kittens?

return010y ago

Glad to see the nosql hype blowing down to reasonable levels. Next up: imperative languages back in vogue.

meeper1610y ago

I'm a purist and also need the absolute fastest lookups with out SQL overhead so I go straight for MDB (Sleepycat BerkleyDB) - faster than LevelDB or any others.

rubenolivares10y ago

Why can't autists just use whatever they want and stop trying to convince people that their way is best.

j / k navigate · click thread line to collapse