This is not to say you couldn't turn Postgres into something similar. Put the data on a ramfs, relax various write guarantees (lots of knobs in PG) etc.
It depends in which NoSQL database. Depending on your problem, you can probably find a NoSQL DB optimised for it. It will often be unclear if that NoSQL DB is actually better than a relational database until you try it.
Examples:
High write throughput: Cassandra
Simple key-value: Redis
Text search: Solr/Elastic Search
etc..
That said, i would hesitate to describe Redis as a database at all. A key characteristic of databases is that they store every write in a durable way. Redis can checkpoint its state periodically, but as i understand it, it either can't or typically isn't used to safely keep every write. Redis is something in between a database and memcached. I doubt there's ever a situation where you have to choose between PostgreSQL/Cassandra/CouchDB and Redis; Redis is something you would use in addition to a database.
As for text search - RDBMSs have full text search, and at least in the case of PostregSQL, it seems pretty good - see slide 49 in http://es.slideshare.net/billkarwin/full-text-search-in-post... from 2009. You might not want to be leaning on your database for text search when you're at scale (for operational reasons more than performance ones), but it's a plausible way to start.
For areas other than structure, the difference can be more like the difference between a general purpose programming language and a domain-specific language. SQL databases are general purpose, so they offer a large, nicely integrated package of features, while NoSQL databases often sacrifice some of those features to optimize for some niche. From this perspective, the SQL database can be the best choice for a startup trying to find its niche with the NoSQL chosen as an optimization once that niche has been identified.
The fact that a new business can't be sure what its business will end up being is an argument both for and against using a SQL database.
Each object has a field that keeps track of what fields were updated, so I can construct an update query from that.
When I read a row from the DB, I first check if it's in the write through cache, if so I retrieve it there.
Of course, this means that all writes go through my cache. There can't be any other process that updates that table that avoids writing to the cache, or else we have database inconsistency.
This improved the load in my DB server by more than ten-fold, as well as the indexing time to my ElasticSearch server.
Now, if you are doing anything more mundane (like, for example, logging every bank operation in a country) that's overkill, and non-local constrains are a great thing to have. Thus, I'd advise not using it unless you have a clear necessity.
And always keep in mind that changing yor data is hard, whatever technology you are usig. Relational databases making the change itself hard, while schemaless databases make maintaining the software that reads it hard. Schemaless does give you a small upfront benefit, for a huge cost later.
RDBMS = swiss army chainsaw, can do anything although maybe painfully and slowly.
NoSQL = 5/16th inch torque pre-set wrench for tightening sma connectors to 10 inch-lbs torque. Can't do much else, but its near perfect at its tiny little area of expertise.
On a meta level its a battle between hardware and applications, where app demands have (momentarily, of course) fallen behind commodity hardware ability.
[1] http://technet.microsoft.com/en-us/library/ms191497.aspx
[2] http://publib.boulder.ibm.com/infocenter/db2luw/v9r5/index.j...
[3] http://docs.oracle.com/cd/B28359_01/appdev.111/b28369/xdb_in...
Of course, it is possible to implement all of this on top of existing database technology. E.g. Oracle's Berkeley DB XML is implemented on top of Berkeley DB. But, a relational database with some indexing of XML does not provide the same functionality as an XML database.
But again different NoSQL DBs have different purpose.
Many businesses (including startups) have moved to using document stores for high read environments and scraping nightly drops to their backend analytics systems. This is smart - you don't want to run summing/aggregation on a live transactional system for (hopefully) obvious reasons.
EDIT: it's also worth noting that map/reduce is typically much more powerful when aggregating large datasets. When trying to run analytics on top of a transactional system, developers like Ray here would end up with multiple joins and groupings - all of which slow everything down. Map/reduce certainly isn't perfect, but the author dismisses it as difficult witchcraft when, in practice, parallel execution of MR queries can greatly decrease resources and time to information.
I sort of think we've moved beyond this discussion.
I think this article is targeted at the latter: pre-MVP and just post-MVP. For those startups, having two databases with one dedicated to a backend analytics system reeks of premature optimization.
If you grow to be more than 2-people, then taking a mirror dump of the prod db to run queries against is pretty trivial effort.
Set up a slave of your MySQL or Postgres database (no the slave is not a backup, but dumping it regularly is an easy first step to a very basic backup setup), hosted in a different data centre, and if you're a two person company you now also have somewhere to run analytics whenever you feel ready.
It can be <1 hour effort and a few tens of dollars a month in extra costs for a small system, yet makes a tremendous difference in resilience and gives you that db to run analytics against "for free" whenever you do need it.
Its free software, you don't have to pay for two instances of Oracle.
One thing that will quickly kill a biz is combining the functions of PROD and DEV/TEST. Making the DEV/TEST box the DEV/TEST/REPORTS box is not a big deal, and you can't run a (real) biz without a DEV/TEST box.
Eventually, nothing can match the performance of storing binary blobs on a cluster. But that only becomes worthwhile if you database is significantly larger than a terabyte. And I'm only talking about the operational "core" database, not your "data warehouse" (the log dumping ground, which should be split off when your database gets to be a few dozen gigs).
Meanwhile, mysql has big advantages :
1) can do basic optimization with "ALTER TABLE", even (mostly) live.
2) you can mix PROD and DEV/TEST (though obviously you need to use good judgement). Obviously you should also have a DEV/TEST instance for actual testing. Sometimes you want to run a test quickly against PROD though. Adding a slave, having it sync and then running against the slave is a joy.
3) creating reports is quick, customizable and everything you want.
4) It's "idiot-friendly". Employees can ramp up to the structure in a mysql db in 2 weeks flat. Try that with custom document stores.
5) It's typesafe and relational safe (if correctly designed), with the advantages that brings : significantly less weirdness in the database.
6) Phpmyadmin. Mysql workbench. Django. Php ...
I'm even going to argue that the GP's argument, that running analytics on PROD can get you fired, is not just wrong, it's actually an advantage of using mysql. (And the open source SAP database can run "live" analytics. You just can't believe how great that is for dashboards)
You can use a SQL query for this or a simple map/reduce - either way my argument is to focus on transactional design that works best for your system - don't conflate reporting needs with your live system (i.e. BA queries).
It is not true that relational databases must have a single point of failure (SPoF) or must use failover: MySQL Cluster is a sharded multi-master distributed database without a SPoF.
On the other hand Redis, for example, is a master-slave failover NoSQL datastore.
Especially if they spend their early days fucking with a Cassandra cluster instead of talking to customers.
And it should be noted, you made it anyway.
When you make it by the skin of your teeth, that means you probably timed it right.
Preempting a problem far-ahead of time in startups means time and effort was wasted, especially if it was done before the existence of the problem was established.
It is unreal to me that people still can't figure out how to apply Maslow's hierarchy to startups.
Also, those mistakes in modelling the domain and in enforcing the constraints are often there to stay and slowly become impossible to fix, once you have 10000 records that do not fall into a few well specified states, it's hard to go through all of them, find some common denominators, and migrate the database. Not to mention that with the mess people can do in the code, and with the messy stack in use today, it's easy to introduce bugs that might be hard for anyone to notice but seriously harm your business.
The amount of fashionable nonsense in software engineering seems to be higher than ever, unfortunately.
This is a startup. There is no problem domain. There is no spec. There's none for a startup. You may not even have code.
If you're doing a "startup" but have this ironed out, great, everything you said then applies, but it's not a startup, since you found your business model, it's a small company.
And I would love to enforce constraints on the DB but unfortunately, I already had "primary keys" that repeat, unbeknownst to the project customers.
Is the startup just a game developers do on their free time? Because if it is trying to solve a problem, there is a problem domain, even if it's the wrong problem.
Reinventing the wheel is great fun. Without constraints you can appear to be amazingly productive. But it is all a fantasy because without constraints and a rigorous process you are probably piling up technical debt faster than anything else.
Relational DBs may even be the wrong choice for some specific problems.
And about this quote from the article: "At some point, you will need to ask your primary database questions. If you chose the wrong database, this is where things get tricky. "
Yes, this is correct. However, I know how to read the manual of whatever db I'm using and maybe code something simple to process the output to the format of my liking.
Just because you're storing a huge amount of one specific type of data, that doesn't prevent you from taking advantage of the features of a relational database.
Why not?
1) Relational databases typically aren't optimised for write-throughput. It's quite possible to do it, but you'll need fast and large disks (eg, FusionIO in a SAN or something).
2) Location-tracking applications typically don't require interactive queries - generally it is more a batch-based system that can be run offline.
Saying you are not going to be able to use a relational database is overstating it a bit in my view.
Clearly you can make it work, but something like Cassandra will give you better write thoughput, won't force you to rely on a SAN/NAS for data storage and will let you use Map/Reduce to batch process the data.
If you have a billion customers each with a million records you are the Google Maps location thingy and are not a startup anymore and a relational solution may, or may not, work.
If you have ten users the ideal database is probably ten interns and some whiteboards. I'm not kidding.
I observe there is a common claim brought up every week on HN about fake it till you make it. You don't automate (insert menial task here) until you have ten customers.
I observe there is another common claim brought up every week on HN about how you need a massively scalable design which can never change from day one, because your schema, either formal or informal, is perfect and unchanging, LOL.
Those two stylistic outlooks are not compatible.
Start with something that has too many features, too many abilities, too much room for expansion, like a relational DB, and then later on if you need to, put some stuff on another platform. IF you need to. IF your company survives. Lots of IF.
But one can argue that not careful with schema design can break api and make codebase messy.
I guess I will stick with the hard work now... I guess not careful with schema will definitely bite me.
I find that a poor argument. One can use an ORM that automatically creates a schema based on classes. E.g. I like Ebean with DDL generation. You just write classes and add @Entity annotations. Ebean automatically creates the schema. Combine this with an embedded database, such as h2, and there is virtually nothing to set up.
Once you are out of the rapid iteration phase, you can take the latest Ebean generated schema and use to proper migrations for later changes.
So you don't have to create bullshit m2m tables with tedious joins for a fucking tagging system
So you properly normalized your entire system, customer billing transaction records all the way up to article tags. Then article tags gets too huge. So next version looks at RDBMS and Redis, and the next version after that only looks at Redis. Customer billing transactions remains on a "real" DB and the tag cloud lives on redis. And the problem with that is... what exactly?
Its obsolete thinking. I can't have two databases because we're a poor startup and the only databases that exist are DB2 and Oracle and everyone knows they're super expensive so super expensive times two is unaffordable. Dude, its almost 2014 not 1980, Postgres/mysql/redis its all free.
It is the same thing for choosing Linux distribution and JDK mode. See the references here:
http://bingobo.info/blog/table-of-contents.jsp
BTW, your title should have "relational" instead of "relation".
As far as I know, most graph databases support transactions and offer great scalability. Such databases are also schema-less and can be queried with Gremlin, a powerful graph traversal language (see www.tinkerpop.com).
With respect to scalability and transactions, Titan (http://thinkaurelius.com/) looks very promising: it supports various backends for storage (Cassandra, HBase, etc.) and indexing (currently Elastic Search and Lucene). Graph analytics can be done via Faunus (http://thinkaurelius.github.io/faunus/), backed by Hadoop.
There are other vendors out there (Neo4J, OrientDB, etc.) which offer interesting solutions worth looking at - I'm just a bit less familiar with them.
The major downside I see with graph databases is that most of them are fairly recent and their ecosystem is tiny (though growing). Should a startup venture on such young technologies, or stick to mature and battle-tested solutions (ie. relational databases)?
Could startups use this kind of graph "NoSQL" databases? I don't see why not. If your startup is some kind of social network, graph databases are certainly an option worth considering. If I were to create a startup, I'd hardly use a document database like MongoDB but I will really consider using a graph database. In the end, it's all about having the right tool in hand, and knowing how to assert what is "right" for you.
Thing is. Most SQL database at scale is a bit of a horror too. Have you seen real-life production relational databases? Gawd. Hacks on hacks. Then you add another database. And another analytics database. And a bunch of point to point data feeds. Argh.
But hey. That's data.
If you think choosing SQL will solve your analytics woes down the line -- it's just not true. You're in for some pain no matter what you do.
... That's unless you get a porcelain schema first time. Which, if you're in a startup, probably means you're working on the wrong problem.
That's not an argument for using NoSQL (I used MongoDB daily, but I've got plenty of love for PostgreSQL). It's a rebuttal that SQL magically solves a different problem.
A relational database might be just as wrong as a nosql one if all you need is redis.
Then you get a little closer to the truth.
"All the while moving work onto the developers to standardize how they handle different migration cases."
I know a startup is fast and bla bla... BUT your team should know the tools that you are using... For me SQL DB's force me to add a new field and some kind of value and i don't like to be forced to a solution.
"In document stores, you have two choices: store related data as sub-documents, or store related data as separate documents with references. It is up to the developers to understand the trade-offs of both approaches. Selecting one over the other can lead to performance gains or issues, scalability issues and above all, make asking certain questions of the data a lot harder."
Again know the tools you are using. And for example MongoDB has good ORM's too.
"But that takes much more forethought and is dependent on a particular problem."
If your startup is doing something new and shiny you don't have the knowledge and forethought and you often dont know what particular problem will come at you.
Most of the point's look like: You learned at your University SQL now you know it(but in really life you don't) and now use it because you know how to normalize a Database. This argumentation is often used to say why java is so great or why javascript is bad.
I personally started with php then moved to rails and now to meteor(uses MongoDB) and we never before meteor could make so fast a good prototype which for a startup is very important.
So yeah if you are comfy with SQL use it if your comfy with NoSQL use it.
However relational databases will have hard time with big data because your dataset is bigger than your database and you have no relational integrity.