Comparing Database Types (opens in new tab)

(prisma.io)

521 pointsironcutter6y ago170 comments

170 comments

> Relational databases get their name from the fact that relationships can be defined between tables.

This is a widespread misconception. Relational databases get their name from relations in the mathematical sense[1], i.e. sets of tuples containing facts. The basic idea of the relational model is that logical predicates can be used to query data flexibly without having to change the underlying data structures.

The basic paper by Codd[2] is really worth reading and describes, among other things, the problems of hierarchical and network databases that the relational model is meant to solve.

[1] https://en.wikipedia.org/wiki/Finitary_relation

[2] https://www.seas.upenn.edu/~zives/03f/cis550/codd.pdf

tlarkworthy6y ago

What is the relation? The table? (I.e. the tuples describing the rows). Or is it the joins? (In which case the article is correct). The columns?

woolcap6y ago

One way to think about it is with a mathematical relation, like 'X > Y'. A relational database relation representing this relation would consist of a header tuple, <X, Y>, and a set of tuples whose values satisfy the relation, such as <10, 2>, <8, 3>, <9, 4>. In more common terms, the rows of this table would contain pairs of numbers in which the value of the X attribute is greater than the value of the Y attribute. This table describes the relation(ship) of certain pairs of numbers.

"Each tuple in a relation represents an n-ary relationship...among a set of n values..., and the full set of tuples in a given relation represents the full set of such relationships that happen to exist at some given time--and, mathematically speaking, that's a relation."[1]

[1] Chris Date, Database in Depth, page 46

talaketu6y ago

This points to the dual view of a relation - intensional vs extensional. The beauty of the relational model of data is the morphism: by evaluating relational operations on the extensional model (data) we gain can answer questions corresponding to intensional model (concept).

For example given the extensional data ALBUM(TITLE, ARTIST) corresponding to the intention "the albums, each with a title and artist", we can compute "the eponymous albums, each with a title" via EPONYMOUS_ALBUM = ALBUM where (TITLE = ARTIST)

We started with some data for a relation corresponding with a concept, and were able to operate on the data to produce a new relation - data corresponding to a new concept.

baddox6y ago

The relation is essentially all the rows in a given table. In relational algebra, a relation is a set of tuples of a fixed length. Each position in each tuple is associated with some attribute (essentially the name of the column) and each element in a given position is a value of a certain "data domain" (essentially a data type, like "integer").

taffer6y ago

The table is the relation. A join is just an operation that combines two relations to a superrelation.

Pheebau6y ago

A join in an intersection in relational algebra

3 more replies

skybrian6y ago

The value in a table at a particular time is a relation, but so is the value of a view, the result of a subquery, the result of a select statement, and the value represented by the contents of a CSV file. In some API's this is called a row set.

It's essentially a table-shaped value. Conceptually it's immutable, and relational algebra is about calculating new values from old ones. A select statement does this too.

dsego6y ago

In a sense it reminds me of sentence structure. The table/relations is like the predicate and the rows contains all the subjects/objects that the predicate applies to.

ashelmire6y ago

A relation is a tuple (H, B) with H, the header, and B, the body, a set of tuples that all have the domain H.

jrapdx36y ago

As a non-mathematician I've always conceived of the relations among data as existing in the queries. In a SQL query the "relation" is specified by matching columns in the tables containing the data we're looking. Conceptually indexes, etc., are system implementation details.

That's certainly not a rigorous definition but helped me keep my head straight about what I was doing.

nikolasburk6y ago

Thanks for the hint, we'll update the article! :)

triska6y ago

Related to the logical view, it would also be great to include deductive databases:

https://en.wikipedia.org/wiki/Deductive_database

Deductive databases derive logical consequences based on facts and rules.

Datalog and its superset Prolog are notable instances of this idea, and they make the connection between the relational model and predicate logic particularly evident.

Codd's 1979 paper Extending the Database Relational Model to Capture More Meaning contains additional information about this connection. For example, quoting from Section 3 Relationship to Predicate Logic:

"We now describe two distinct ways in which the relational model can be related to predicate logic. Suppose we think of a database initially as a set of formulas in first-order predicate logic. ..."

mitchtbaum6y ago

It seems these rules refine the data outside the database itself. But they're so tightly integrated between the database and the application that the lines separating them become blurred.

1 more reply

woolcap6y ago

> Relational databases get their name from the fact that relationships can be defined between tables.

Relational databases get their name from the mathematical concept of a relation, used by the Relational Model, "an approach to managing data using a structure and language consistent with first-order predicate logic, first described in 1969 by English computer scientist Edgar F. Codd, where all data is represented in terms of tuples, grouped into relations." [1][2] (emphasis added)

Recommended Reading: Database In Depth, by Chris Date.

[1] https://en.wikipedia.org/wiki/Relational_model

[2] https://en.wikipedia.org/wiki/Relation_(database)

reilly30006y ago

I did a MOOC on relational algebra that made me much more productive in SQL and better appreciate the gravity of what RDBMS really offer. Understanding relational algebra helps demystify the magic or query planners and grok why they both add and reduce latency based on use cases.

edmundsauto6y ago

Mind sharing the course? Sounds useful.

reilly30006y ago

https://www.coursera.org/learn/data-manipulation

tacotime6y ago

here is one: https://lagunita.stanford.edu/courses/DB/RA/SelfPaced/about

pjungwir6y ago

Codd's 1979 paper "Extending the Relational Model" [1] is really interesting, especially the second half. The first half is about nulls and outer joins, and I think that steals everyone's attention. But the second half basically gives a way to turn your RDBMS into a graph database by (among other things) letting you query the system catalog and dynamically construct your query based on the results. This would never work with today's parse-plan-optimize-execute pipelines, but it's a really cool idea, and I've certainly often wished for something like it. I'd love to know if anyone has followed up on these ideas, either in scholarship or in built tools.

[1] https://gertjans.home.xs4all.nl/usenet/microsoft.public.sqls...

einpoklum6y ago

The document completely overlooks Columnar databases, which are focused on analytics and are much faster than (most, not all) general-purpose DBMSes. See:

https://en.wikipedia.org/wiki/Column-oriented_DBMS

and

https://www.slideshare.net/arangodb/introduction-to-column-o...

or get:

http://www.nowpublishers.com/article/Details/DBS-024

Examples:

* MonetDB

* SAP Hana

* Actian Vector (formerly Vectorwise)

* Oracle In-Memory

ralusek6y ago

A database being columnar is more of an implementation detail regarding the underlying storage mechanism than it is a type of database. It has to do with how the data is physically stored, i.e. by putting values across rows of a single column adjacent to one another on disc/in memory.

kthejoker26y ago

Also Druid, HBase, Vertipaq (engine behind PowerBI), Redshift, Azure SQL DW, etc

Columnar compression is a really interesting engineering problem

nicoburns6y ago

Also notable is the Postgres column store extension.

Not as fancy / performant as the dedicated columns store databases, but it allows you mix and match row-tables with column-tables which is pretty nice.

einpoklum6y ago

Yes it is, and I've written academically about this. For example: https://is.gd/9wwjjf

shrumm6y ago

ClickHouse is another favourite

einpoklum6y ago

Clickhouse is a columnar system, yes, but is not a full-fledged DBMS. Specifically, I don't think it can join tables.

NotSammyHagar6y ago

memsql (which was in there for new sql).

einpoklum6y ago

Eh... not quite. It's in-memory representation is row-based. It seems it uses columnar secondary storage. At least - that's what it says here: https://en.wikipedia.org/wiki/MemSQL

takeda6y ago

They are much faster for analytical queries, not for transactional ones. If they were overall faster they would be the default relational databases.

einpoklum6y ago

Yes, I did say "focused on Analytics" but I didn't clarify that they're only/mostly fast for analytic workloads.

cptnapalm6y ago

I know very, very little about any of this, but would this be akin to entity component systems in video games? Forgive me if I'm way off.

einpoklum6y ago

Sorry, I don't know what "entity component systems" are. If you mean saving the same component for all entities, rather than saving a bag of components for each of the entities, then sort of.

cptnapalm6y ago

Close enough to be helpful. Thank you.

muydeemer6y ago

The article reminds of the work of Stonebraker and Hellerstein - What Goes Around Comes Around, which gives a description of how the database world goes in cycles (can be found here: https://people.cs.umass.edu/~yanlei/courses/CS691LL-f06/pape...)

gibsonf16y ago

The discussion of graph dbs completely misses the semantic rdf graph approach and how that differs greatly from the property graph (which is discussed). So important is not having to have a custom schema for each application that does not communicate with any other app as opposed to using standard ontologies with relationships and classes that are known and allow interoperability between systems (Linked Data Platform - Solid)

planck016y ago

Do you know of any successfully semantic RDF graph databases, I guess with OWL support? Because I personally don't. If not, it probably is rightfully too much an academic niche to be discussed in the article.

kthejoker26y ago

Stardog, MarkLogic, Virtuoso, AllegroGraph, and RDF4J all have commercial applications, but yeah in general semantic RDF is dying on the vine.

gibsonf16y ago

We are using Blazegraph and Neptune in production as well as Allegrograph. With Neptune, we tested scale by putting the entire dbpedia on one 4 core machine with 16G of ram. It handled 2.7 billion statements without any issues (we ran out of time with the test - sure it can handle more)

hmottestad6y ago

Stardog is quite successful.

dehrmann6y ago

Does Top Quadrant have anything that does this?

AlphaWeaver6y ago

This article comes from the team at Prisma, who are doing some really cool work building "database schema management and design with code" tools. They're working on a new version of their library right now (Prisma 2) and are regularly giving updates to the community and providing test versions.

Most everything they make is open source and really well designed. Would recommend checking it out!

nikolasburk6y ago

Nikolas from the Prisma team here! Thanks a lot for the endorsement, we're indeed super excited about the current database space and the tooling we see emerging.

For anyone that wants to check out what we're up to, you can find everything you need in this repo: https://github.com/prisma/prisma2

616c6y ago

I am curious about Prisma2 because I tried to build a server side API with v1 as a novice to graph systems and it became an unwieldy nightmare. Partially my fault for wanting to do it without the SaaS they provide but trying to build with something complicated and Apollo on the frontend with a skilled FE dev got me so confused I put it off.

nikolasburk6y ago

Prisma 1 indeed has a couple of quirks that we're currently ironing out with Prisma 2 (or the "Prisma Framework" as we now call it). Would love to hear from your whether the new version actually solves your pain points!

Feel free to reach out to me: burk@prisma.io or @nikolasburk on the Prisma Slack https://slack.prisma.io

SPascareli136y ago

I was looking into Prisma + Apollo, can you expand more on your problems with it? To be it seemed really magical at first, but I don't know how it really works in production.

muydeemer6y ago

Just a quick remark on graph dbs. Titan which is mentioned in the article as an example of a graph db is dead. Its successor is the Janus graph (https://github.com/JanusGraph/janusgraph).

planck016y ago

I am surprised Dgraph isn't mentioned as an example. It is the most starred graph db on Github, and I think it is the best one in terms of performance and scalability.

mdaniel6y ago

Strange that they have to have such a non-standard license, when they go out of their way to mention Apache 2 several times: https://github.com/dgraph-io/dgraph/blob/master/LICENSE.md

Contrast that with Orient, who also have an Enterprise version, and they just straight-up say "Apache 2, no drama" https://github.com/orientechnologies/orientdb/blob/develop/l...

We had an absolutely miserable experience trying to get Janus to behave rationally, and thus far have had zero drama with Orient; we skipped dgraph because it does not appear to work with Gremlin, meaning one must use vendor-specific APIs to use dgraph.

Their client reminds me of the days before ORM: write a big string literal and send it to the server: https://github.com/dgraph-io/dgraph4j#running-a-query

mrjn6y ago

> write a big string literal and send it to the server

To be honest, that describes almost all query languages, including SQL and Cypher. Dgraph's QL is based on GraphQL[1]. When starting Dgraph, we had a choice to go with Gremlin or Cypher. But, we didn't like either choices -- Gremlin tends to be slower to execute, because of how they treat the queries, i.e. graph walks, with iteration over results to generate the next set of queries; something we explicitly avoid [2]. And both of them returned lists of things.

GraphQL, on the other hand, allows a subgraph to be returned. One can go from a subgraph to a list of things (lossy), but can't go the other way around, because you lose relationships (knowledge about how they are connected).

That's what led us to the path of GraphQL. Over time we modified it to what we have now, GraphQL+-. Meanwhile, over these last 3 years, GraphQL has become so much more popular than either Gremlin or Cypher, that now Neo4j is a founding part of the GraphQL Foundation, just like Dgraph is.

Now, Gremlin is still relevant in the graph space. So, we're planning to build Gremlin support. But, before that, we're working on building a native support for official GraphQL spec. That should let one integrate into the fast growing GraphQL ecosystem.

[1]: https://graphql.org/ [2]: https://blog.dgraph.io/post/why-google-needed-graph-serving-...

campoy6y ago

Hi there,

Head of Product at Dgraph here. Our license is Apache v2 for most of our source code.

Only our enterprise features are licensed with the Dgraph Community License [DCL](https://github.com/dgraph-io/dgraph/blob/master/licenses/DCL...).

We decided to make all of our code available on GitHub regardless of the license (unlike OrientDB if I understand correctly) so anyone can see what the Enterprise Edition really does.

Let us know if this is confusing. If you don't want the proprietary code, you can always only use the open-source version of Dgraph with `make oss`.

I'm also really interested in seeing where GQL (gql.today) is going, adopting a community-wide standard would be a definite yes from us.

In any case, thanks for your feedback. I'd love to hear more on how we can make our product better, feel free to email me at francesc@dgraph.io.

FridgeSeal6y ago

I believe support for additional query languages is in the roadmap.

mrjn6y ago

(Dgraph author) Particularly embarrassing because I actually know the founders of Prisma ;-). Amazing folks!

They even included YugaByte, with only 2.8K GitHub stars. Dgraph crossed 11K GitHub stars and is in the top 10 Graph DBs on DB Engine now -- what would it take for us to be in the article, Søren?

Just joking. Nice article! Keep up the good work, guys!

imchairmanm6y ago

Hey there, I'm responsible for this omission! I've gotten quite a few comments internally and externally so I'll make sure to give Dgraph a mention when I incorporate some of the feedback I've received from this thread. Thanks for the heads up!

1 more reply

fitzoh6y ago

There's also Datastax Enterprise Graph (commercial) from the team behind Titan after they were acquired by Datastax.

https://www.datastax.com/products/datastax-graph

https://venturebeat.com/2015/02/03/datastax-acquires-aureliu...

imchairmanm6y ago

Thanks for the heads up. I'll try to incorporate that into the article soon.

tabtab6y ago

I'd like to see "dynamic relational" implemented. It's conceptually very similar to existing RDBMS and can use SQL (with some minor variations for comparing more explicitly). You don't have to throw away your RDBMS experience and start over.

And you can incrementally "lock it down" so that you get RDBMS-like protections when projects mature. For example, you may add required-field constraints (non-blank) and type constraints (must be parsable as a number, for instance). Thus, it's good for prototyping and gradually migrating to production. It may not be as fast as an RDBMS for large datasets, though. But that's often the price for dynamicness. (A fancy version could allow migrating or integrating tables to/with a static system, but let's walk before we run.)

https://stackoverflow.com/questions/66385/dynamic-database-s...

Some smaller university out there can make a name for themselves by implementing it. I've been kicking around doing it myself, but I'd have to retire first.

asah6y ago

I've done this (commercially!) with PostgreSQL - just start with a single table, with one JSON field, and as you want performance, integrity, etc, add expression indexes, break out frequently used expressions into columns etc. On large tables, obviously there's a cost for this reorganization but you can partition the data first, and only reorg the most recent data (e.g. range partitioning by time).

https://www.google.com/search?q=expression+index+postgres

https://www.postgresql.org/docs/10/ddl-partitioning.html

zawerf6y ago

I am trying this out and I am still on the edge of whether I like it or not.

Create a table with a json column:

  CREATE TABLE Doc (
    id UUID PRIMARY KEY,
    val JSONB NOT NULL
  );

Then later it turns out all documents have user_ids so you add a check constraint and an index:

  ALTER TABLE Doc ADD CONSTRAINT check_doc_val CHECK (
    jsonb_typeof(val)='object' AND
    val ? 'user_id' AND
    jsonb_typeof(val->'user_id')='string'
  );
  CREATE INDEX doc_user_id ON Doc ((val->>'user_id'));

I think the postgres syntax for this is pretty ugly. And if you also want foreign key constraints you still have to move that part of the json out as a real column (or duplicate it as a column on Doc). I am not sure it's even worth it to have postgres check these constraints (vs just checking them in code).

I am also a little worried about performance (maybe prematurely). If that document is large, you will be rewriting the entire json blob each time you modify anything in it. A properly normalized schema can get away with a lot less rewriting?

jayd166y ago

At this point, I would just bite the bullet, break out a new user_id column and run a migration to populate the column.

takeda6y ago

Just a note about using uuid as a primary key. Typically you will use a b-tree index, which likes to keep things sorted. So something like a serial number works best, because it is already sorted and will be appended at the end. Otherwise inserting a new column will cause traversal the b-tree all over the place which will hurt performance it you do a lot of inserts.

If you really want to use uuid and care about performance you might prefix it with something that's increasing like a date, or perhaps (did not try it) use hash index (need to be PG 10+).

1 more reply

tabtab6y ago

It seems to treat "direct" columns different than JSON-ized columns. That bifurcation creates two ways of doing things. Ideally, it should be seamless.

ucarion6y ago

Do you have any resources online for how well this approach works in practice? I've been thinking about doing this in replacing a MongoDB database.

cagmz6y ago

Anecdotally, it's worked out well. Before storing a JSON object, we validate it at the app-level using a JSON schema (which has types, required keys, etc). This lets us write without worrying to much.

Once we felt that the schema wasn't changing as much, had too many concerns, etc, we made tables and wrote to them (instead of the JSON column).

pas6y ago

Could you please elaborate on why do you want to replace MongoDB?

thunderbong6y ago

Have you seen RedBean ORM? https://www.redbeanphp.com/index.php

tabtab6y ago

I don't really want an automatically generated static RDBMS, I'd like a dynamic RDBMS. For one, one can use it with any programming language, not just PHP.

pas6y ago

You can create indices in MongoDB on JSON, works pretty well.

And since 4.x they have transactions over shards too!

tabtab6y ago

But the "interface" to the JSON data is different than regular columns, and arguably awkward. The dynamic relational assumption is that one can use regular SQL just like with regular "static" tables (with some minor differences to make compare-type more explicit).

pas6y ago

I find SQL much more awkward than proper language-idiomatic APIs.

Yes, SQL shines when you can whip up a big multi-table join, with nice group by-s and wheres, havings and whatnots, but that's rare. Though, arguably, ORMs solved most of the crazy string concatenation query crafting craziness.

bryanlarsen6y ago

The description of flat-file database seems too restrictive. In my experience, flat files with fixed record lengths and no delimiters were far more common than variable-length delimited formats like CSV. File sizes were often much larger than computer memory size, so random read & write was necessary.

kps6y ago

Yes. The origin of the flat-file database is fixed-format unit record equipment¹, predating computers. COBOL is essentially a language designed for processing fixed-format files.

¹ https://en.wikipedia.org/wiki/Unit_record_equipment

imchairmanm6y ago

That's definitely a good point. I'll try to update the article to reflect that soon. Thanks for the feedback!

rainyMammoth6y ago

What about time series databases that are fairly common nowadays ?

manigandham6y ago

Time-series is more about a specific use-case about data that has a primary time component (like sensor metrics). You can store it in any database, although the common ones are usually some sort of key/value or relational with specific features for time-based queries.

Hbase/Bigtable/DynamoDB/Cassandra are key/value. InfluxDB is key/value. Timescale is an extension to Postgres.

jnordwick6y ago

The big time TS databases (Sybase, KDB, Informix Datawarehouse) are column-based, not key value or traditional relational row-oriented. The ones you list are all lower-tier trying to shoehorn a time field on another model.

manigandham6y ago

Those are still relational databases, just with column-oriented/column-store tables. I don't see how the storage layer changes the database type. For example, MemSQL has both rowstores and columnstores. Postgres 12 has pluggable storage with column-store (zedstore).

1 more reply

shakkhar6y ago

If you could store timeseries data "in any database", kdb wouldn't be a thing. Just go and ask a quant trader replace his kdb instance with postgres. (Be prepared to be laughed out of the room.)

manigandham6y ago

I'm not sure what your point is. Time-series describes the data, not the database.

You can store timeseries data in Postgres if you want to (and optionally adding extensions like Timescale). You can store it in key/value like Redis or Cassandra. You can store it in bigtable. You can store it in MongoDB. Obviously different scenarios require different solutions.

KDB is a relational database with row and columnstore with features for time series and advanced numerical analytics along with a programming language. KDB is a thing because of those abilities, whether you use it for time-series data or not.

1 more reply

neop1x6y ago

Good point, I think it should be mentioned because those DBs are special in that they are optimized for analyzing data changes over time. They store timestamps effectively and often allow filtering and aggregation on tags and in time intervals. It is not common to query RDBM for a chart of 10-minute averages of latency, histograms, quantiles, etc and do things like downscaling from 1-second intervals to 1minute intervals Great examples: Prometheus, Graphite, InfluxDB

nudpiedo6y ago

All of them are in fact graph databases, they just didn't realize about it and got lost giving the implementation the category of design for many reasons specific to the context in which they were created. I think we should think more often as mathematicians and a little bit less as "hackers"

TheMiller6y ago

I think this is a mischaracterization. The relational model which motivated relational DMBSs is based on predicate logic. Mappings to graphs are obvious, but are not the organizing principle. This was one of the strengths of the relational model, encouraging a more flexible view of the data than graph databases had previously offered. In a complex relational schema, you can discover and work with all kinds of implicit graphs that were not originally intended by the schema design.

edmundsauto6y ago

If they are all described as graph databases, we lose the usefulness of understanding the differences between them. I think understanding these differences are at least interesting, and possibly useful.

_Understated_6y ago

I'm curious... what did the author mean by this:

> Legacy database types represent milestones on the path to modern databases. These may still find a foothold in certain specialized environments, but have mostly been replaced by more robust alternatives for production environments.

I didn't notice anything that went into any detail about legacy database types.

Any idea what the author means by a "Legacy Database"?

AlphaWeaver6y ago

The article gives some examples of these, it seems they're mostly referring to things we wouldn't consider "databases" like flat files.

tempguy99996y ago

He actually tells you in the article, straight after (flat file, hierarchical...)

_Understated_6y ago

Aww man. I am a dumbass... I never equated that section of things like Network databases and such as legacy.

Dunno how I missed it :(

*Must read slower...

tempguy99996y ago

We've all done it :) No worries!

AtlasBarfed6y ago

Mainframe days

honkycat6y ago

There is a great chapter in "Designing Data Intensive Applications" about this very subject

bryanrasmussen6y ago

Again a renaming that makes what the article is actually about less clear.

victor1066y ago

Can anyone here point to a resource that gives a comprehensive treatment (use cases, pros and cons etc) of all the types (Nosql, NewSQL, relational, timeseries) of databases being used today?

thekhatribharat6y ago

[Shameless Plug] A summary of the evergrowing NoSQL and NewSQL market: https://medium.com/open-factory/nosql-newsql-a-smorgasboard-...

minitoar6y ago

How would you categorize something like ClickHouse or Interana or Druid? Columnar I guess, but then the description of Column-family in the article doesn't match up with my experience of how those work.

imchairmanm6y ago

Hello, author here. That's a good question and something I had a hard time sorting out as I worked on this.

I think those fall into a different category confusingly sometimes called column-oriented databases. They're primarily used for analytic-focused tasks and get their name from storing data by column instead of by row (all data in a single column is stored to disk together).

I didn't include those as a separate category here because they're basically relational databases with a different underlying storage strategy to allow for easier column-based aggregation and so forth.

My colleague shared this article [1] with me, which definitely helped inform how I distinguished between the two in my head.

[1] http://dbmsmusings.blogspot.com/2010/03/distinguishing-two-m...

minitoar6y ago

That makes sense, they really are just relational databases optimized for certain tasks, with corresponding limitations e.g. they don't support arbitrary joins.

barrkel6y ago

There's nothing intrinsic about not supporting joins, in a columnar store; it's just that you lose a huge amount of the linear scanning performance if you have to do joins for each value. Most columnar stores I've used (primarily Impala, SparkSQL and Clickhouse) all support joins, but they materialize one side of the join as an in-memory hash table, which limits the allowable size of the join, and is a cost multiplier for a distributed query. I believe per the docs that MemSQL can mix and match row-based with columnar more easily, but joins are always going to be really slow compared to the speed you can get from scanning the minimum number of columns to answer your question.

2 more replies

intellix6y ago

Skipped through it looking for an answer but didn't see it: where are unions in Prisma?! Was looking for some big reveal about an underlying choice that enables what everyone is begging for

matthewmueller6y ago

I've been mapping out union types at Prisma.

Are you in our Slack? I'm @mattmueller at https://prisma.slack.com.

I'd love to chat with you to better understand your use cases, so we can make sure we're designing it for you.

dehrmann6y ago

The column-family databases mentioned (Cassandra, HBase) are both just fancy key-value stores that add semantics for separate tables and cell-level data so you're not rolling it yourself.

neop1x6y ago

No Elastic among the examples while highly popular and nice :( Great article overall, though!

CMCDragonkai6y ago

There's also column-oriented or array databases like MonetDB and Rasdaman.

zbentley6y ago

And the elephant in the room: Cassandra.

sourcepath6y ago

What happened with Prisma being all about graphql?

nikolasburk6y ago

GraphQL is a really important use case for Prisma. That is using Prisma as the "data layer" on top of your database when implementing a GraphQL server (e.g. using Apollo Server). However, it's not the only use case since you can effectively use it with any application that needs to access a database (e.g. REST or gRPC APIs). We actually wrote a blog post exactly on this topic: https://www.prisma.io/blog/prisma-and-graphql-mfl5y2r7t49c/

You can also find examples for the various use cases here: https://github.com/prisma/prisma-examples/tree/prisma2

Please let me know if that clarifies it or if you have more questions! :)

galaxyLogic6y ago

What happened to Object-Oriented Databases?

AtlasBarfed6y ago

Document databases kind of killed them I would specualte. Since JSON serializes with objects so much better than (ugh) XML, the Relational impedence is gone (well, a lot of it).

chrisweekly6y ago

I didn't RTFA, but based on titles alone, isn't "Object Database" missing from the list?

marcosdumay6y ago

Aren't those a special case of hierarchical databases?

marknadal6y ago

What a lovely article!

It should be emphasized that graph databases can do all other types of databases (relational, document, key/value, etc.) as you can see demonstrated in this article (https://gun.eco/docs/Graph-Guide).

This makes graphs a superior data structure.

If you think about the math, any document is a trie, and tables are a matrix. Both trees and matrices can be represented as graphs. But not all graphs can be represented as a tree or graph.

This gets even more fun when you get into hypergraphs and bigraphs, which are totally possible with property graph databases where nodes have type!

jt21906y ago

> This makes graphs a superior data structure.

I'll read this generously and assume you meant to say that graphs are an essential data structure, i.e. we can use a graph to represent the more specific data structures used by various types of databases (e.g. A b-tree is a type of graph)

Whether a graph data store or a more specialized tool (e.g. a relational database, etc.) is superior depends (as I'm sure you agree) on context.

danenania6y ago

“It should be emphasized that graph databases can do all other types of databases (relational, document, key/value, etc.)”

Not to knock graph dbs, but isn’t the reverse also true?

mumblemumble6y ago

Yes. And it may even be the best way to do it. For example, here's a paper where the authors come up with a schema and transpiler for doing a Gremlin-queryable graph DB in PostgreSQL, and find that it outperforms Neo4j and Titan:

https://static.googleusercontent.com/media/research.google.c...

gbear06y ago

That's interesting, but kinda makes sense since it would be optimizing specific access patterns by translating it to relational models rather than using the normal graph walking algs to find relations.

As an anecdote for one project, while trying to speed up some neo4j queries myself, I decided to model a binary tree structure in the nodes (child/parent relations) and then compared the query times for using the simple cypher queries vs cypher queries with some embedded lib functions that would walk the tree exactly the way I wanted. The times were much faster for something that I hadn't even optimized much code wise.

The test got me thinking that if I could have a way of declaring more info about how the relationships are related, then maybe we could automatically have the db use more appropriate algs for a more appropriate data structure for certain node types. I think that's similar to what's happening here, it's automatically mapping out the simple graph relations to structured relational db tables. I hope in the future we'll be able to provide more input to that as well (or at least I haven't seen something like this yet). Let me annotate my schema to specify my parent/child relationship as a tree, or my word map nodes form a trie but the leaves should be some other type. Why can't we think of a db as having multiple datatypes beyond just a kv-store, or table, or graph?

1 more reply

AndrewBowman6y ago

That compares against Neo4j 1.9.4, released in 2013. All technologies in question have improved much since then, especially graph db technology, efficiency, and speed, so I don't think that paper has as much relevance anymore. Would love to see a more updated comparison.

lmkg6y ago

One of the key features of graph databases is to select one node, and then recursively 'chase' edges until you find another node matching some criteria. Other database models can have trouble representing chasing an unbounded number of edges. E.g. in the relational model, following an edge to another node is usually represented as a Join operation, and SQL doesn't let you parameterize the number of iterated joins. This is especially true if the thing you want to query is actually the path length.

In an HN thread from a few days ago, someone made the claim that the graph model could be represented by SQL + recursion, and recursive SQL is an extension offered by some databases. But the relational model itself cannot fully represent the graph model.

Without digging too deep, I suspect other database models run into similar problems. E.g. a document store could very easily represent a Directed Acyclic Graph as a document, but when you get into general graphs your document needs to end on a value that is the key to another graph.

This is not agree with the claim that graph databases are generally superior. I like them, and they're fun, and I think more developers should be aware of them for cases where they apply, but I also don't think they have advantages over relational or document stores when the data is natively table-shaped or DAG-shaped.

henryfjordan6y ago

You absolutely can use recursive SQL to build a graph in a relational database, I've done it. You make a table with a primary key and some data and then make another table that represents edges between objects in the first table. Then, like you describe, you can use recursive queries (built into Postgres) to query your graph. You end up with an adjacency list which might not be the most efficient way to represent your graph, but it works well enough up to a certain size.

I don't quite get what you mean by "the relational model itself cannot fully represent the graph model". If you can store all the edges of a graph and query them, what can't be done?

Graph Databases, depending on the underlying storage model, can have performance increases over relational databases when the JOINs are what you are interested in. This is because JOINs in SQL are often O(n) or O(log n) (if you index the join) where N is the number of rows in the target table, but following a relationship in a graph might be O(1) if you store an object with pointers to other objects right there. Writes might be more expensive though.

It all comes down to use case

danenania6y ago

It seem similar to the SQL vs. NoSQL dichotomy in that it's a question of general ad hoc queries vs. predetermined queries. SQL lets you query anything with or without an index, but you pay for this flexibility with limits on performance and scalability.

In the case of graphs, if you know the exact shape of your graph and its query patterns ahead of time, you can design the optimal structure in a KV/NoSQL store. But if you need/want the flexibility to do ad hoc queries of the graph in a reasonably performant way, that's where the graph db shines.

ithkuil6y ago

A rather long write-up but gives some context about why it's hard to build a graph data model as a layer on top of commodity non-graph databases: https://blog.dgraph.io/post/why-google-needed-graph-serving-...

(Obviously, the underlying storage layer of a graph db will use some sort of simpler storage layer, usually some kind of key value store)

namelosw6y ago

Some databases design like extremely simple key-value databases cannot efficiently express joining relation unless load all the data in memory. The same could be said for column based databases etc. I guess that's quite a difference.

danenania6y ago

"key-value databases cannot efficiently express joining relation unless load all the data in memory"

With the right design can't you store the graph data across multiple keys and then load pieces of it selectively into memory? I get that a graph db specializes in this pattern and makes it more efficient, easier to query, etc., but that doesn't mean it's the only type of db that can model a graph.

1 more reply

cbtx6y ago

There's a difference between being able to do something and doing something well.

kristoff_it6y ago

> To store data, you provide a key and the blob of data you wish to save, for example a JSON object, an image, or plain text. To retrieve data, you provide the key and will then be given the blob of data back. The database does not evaluate the data it is storing and allows limited ways of interacting with it.

Definitely not a good description of Redis, even though they cite it as the first example of a Key-Value DB.

nailer6y ago

> Relational vs. Document

Tabular vs Document. Having relations is orthogonal to the shape of your data. There are document databases with relations - RethinkDB was pretty popular. Mongo sadly doesn't have them but will probably eventually get them too.

takeda6y ago

The adjective relational in a relational database comes from mathematical relations, tuples i.e. data in tables.

It's common misconception that it is from foreign keys.

nailer6y ago

That's interesting - it seems to both be backed by and conflict with a lot of https://en.wikipedia.org/wiki/Relational_model but maybe that's wrong.

I'd still avoid the word 'relational' though - obvious many people will assume 'relational' is related to DB relations rather than tuples (assuming you're right about 'relations' meaning tuples, a lot of the wikipedia contributors are included).

Nican6y ago

I am really tired of articles that talk about the different types of databases. People can make a graph databases act like relational databases, and vice-versa. Computers, in the end, are just a Turing machine. Just pay attention that the query that you are executing is actually doing the optimal solution.

I wish more time would be spent talking about the underlying algorithms that the different query languages use to accomplish the tasks. It is important for developers to understand the execution complexity of queries, and how data is distributed across a cluster.

For example, I am usually surprised when people talk about "web-scale", but they do not understand the difference between a "merge-join" and a "hash-join". Or when people do not realize that a sort requires the whole result set to be materialized and sorted.

scottlamb6y ago

> For example, I am usually surprised when people talk about "web-scale", but they do not understand the difference between a "merge-join" and a "hash-join".

In fairness, I think "web-scale" generally means the serving path of a website with (say) hundreds of millions of active users. In other words, a heavy OLTP workload. The total query volume is too high for a single-machine DBMS but each operation executed is probably simple. They may not be doing joins at all; many of these websites have gotten away with key/value databases. Where they are joining, most likely at least one side of the join is a small amount of data directly owned by calling end user. (In social products, the other side might be say the users table, to map userids to names, email addresses, profile photos, etc.)

Big joins are more likely to happen in offline paths but likely via something like MapReduce rather than in the database server, and that batch processing framework may use different terminology for similar needs.

In that context, I think it's relatively understandable why someone would be fuzzy on merge-join vs hash-join. There are other skills they might need that are specific to key-value or "NewSQL" databases like Bigtable or Spanner. I wouldn't expect someone who doesn't work on a "web-scale" system to know much about this. These skills aren't simply additive, and "web-scale" isn't necessarily harder, just different.

And then of course there's people who think they have a web-scale website when it's not popular enough that you need to give up on single-machine DBMSs. There's just no hard problem there: not expense of single operations, not overall expense.

jorblumesea6y ago

> Computers, in the end, are just a Turing machine.

This isn't a helpful abstraction. Electricity is just electrons after all, why think about how you're going to wire your house when it's all just atoms? Read a bunch of books on quantum physics, then become an electrician /s

> Just pay attention that the query that you are executing is actually doing the optimal solution.

Isn't that the entire reason for articles like this? eg: If you have data with large amounts of relations a no-sql database probably isn't the right approach.

> I wish more time would be spent talking about the underlying algorithms that the different query languages use to accomplish the tasks.

Absolutely, if just for knowledge's sake, but these are largely not needed if you understand the high level use cases for any given DB. It's cool if you understand B-Trees but not strictly needed to use sql. In fact, many would this this not helpful.

threeseed6y ago

This is non-sensical.

Take a highly nested JSON document and try and implement it in a relational database. You would have an O(1) lookup in MongoDB and a O(n) lookup in MySQL. Or good luck traversing a graph structure in MongoDB when you should've used Neo4J. So in order to have a performant database you do need to "pay attention" and ensure that your access patterns suit the database type.

Also Web Scale is about any approach used at the bigger web companies. And the type of people who use the term are not the same people who would be sitting there optimising SQL queries. So I wouldn't conflate the two.

judofyr6y ago

> This is non-sensical.

Your examples are kinda proving yourself wrong though.

> Take a highly nested JSON document and try and implement it in a relational database. You would have an O(1) lookup in MongoDB and a O(n) lookup in MySQL.

In PostgreSQL (which is not MySQL, but it's a relational database) you can create an index for a column which contains JSON and the lookup for a nested field becomes O(1).

> Or good luck traversing a graph structure in MongoDB when you should've used Neo4J.

Facebook implemented the social graph on top of MySQL: https://www.facebook.com/notes/facebook-engineering/tao-the-....

> So in order to have a performant database you do need to "pay attention" and ensure that your access patterns suit the database type.

I think the point of the parent poster was that you don't need to ensure that your access pattern suite the database type, but that it fits well with the features the database provides. If you're looking at the database landscape today talking about a "database type" isn't really saying much at all.

threeseed6y ago

That's because you took my examples and distorted them.

I specifically mentioned MongoDB and MySQL. Having a JSON type is basically a mini document database since it uses a different/enhanced query language and is often not relational with the rest of the schema.

I also specifically mentioned MongoDB and Neo4J. Yes you could implement a graph in MySQL but as your paper lists you will need to front it with a Memcache in order to cache the graph traversal lookups. With a graph database you generally don't need such a cache.

Nican6y ago

That is exactly it. Some SQL databases (CockroachDB and Postgres come to mind) can have the option of having O(1) JSON lookup. It just has to implement the correct algorithm, and index the data the correct way.

It is not about the database type. It is about the underlying algorithms and data structures.

doh6y ago

Well, not really. It's about how you store the data. For instance all graph data can be stored as hexastore [0] which can be stored literally in any database and you will get similar-ish performance out of them.

Some databases are optimized for certain workflows, so for instance they will store only "pointers" to certain data rather than copying them. In addition their clients obfuscate and abstract the storage layer from user.

But you can really store any data in any database, you just might have to do some extra pre/post-processing and multiple the storage requirements.

[0] http://karras.rutgers.edu/hexastore.pdf

zawerf6y ago

Technical note: big oh isn't a useful measure here. Most databases use b-trees (yes, even mongo) so lookups are at best O(log(n)). That goes for you and the people replying too.

The constant factors are way more important here. It's a 1000x factor difference depending on how durable you need your data to be (whether you need to write to disk or a quorum of network nodes in multiple regions). That is basically the only thing that mattered in the recent mongo vs postgres benchmarks.

alexchamberlain6y ago

When you say O(1)/O(n), what are you measuring the scale in relation to? Number of documents? Number of keys in the document you're looking for?

roywiggins6y ago

Working in a relational-ish database built on top of a poorly-abstracted B-tree object store is among the top ten worst development hells I've personally endured. It was awful.

Yes, you can sort of make any database behave like any other database. But you don't want to. You want a power tool that has query planners built in that can efficiently solve the sort of problems you're dealing with.

manigandham6y ago

The type of database and data model is critical. People trying to make one act like another has led to countless problems.

Everything you said about algorithms and query complexity is part of that foundation, not a replacement for it.

bradly6y ago

> I am really tired of articles that talk about the different types of databases.

I read this article and learned things I didn't know.

What would you like to have done differently? The author not write the post? Or someone not post it to HN? I don't understand your complaint. If you'd like a different article, write a different article.

paulddraper6y ago

> I am really tired of articles that talk about the different types of databases. People can make a graph databases act like relational databases, and vice-versa.

> I wish more time would be spent talking about the underlying algorithms

That's precisely the difference. How they store data is fundamental to what kinds of operations are fast. A row-based and columnar database can look rather similar (PostgreSQL/Redshift). But the performance characteristics are far different.

Did the article come up short?

Nican6y ago

I think part of the problem is that databases can have a mix-and-match of features, and it is hard to classify one database into a single category. I think the article does little in actually helping a developer make an informed decision about what underlying data structures actually fits best with their scenario.

To give some examples, CockroachDB both has column-families and NoSQL characteristics. A column can be specified to be a JSON column, and it can have an inverted index.

Or MemSQL has both row-based and column-based tables, and they use an unorthodox index called "skip lists".

CockroachDB and MemSQL both have different applications and characteristis, but they are just cluttered under "NewSQL", as if it was just was some kind of "SQL but better".

ska6y ago

> Computers, in the end, are just a Turing machine.

Famous last words :)

Detail, in the end, really matter.

tempWinHater6y ago

Especially when you consider that computers are approximations of Turing machines.

Edit: typo

okram6y ago

Check out mm-ADT: http://rredux.com/mm-adt/. The goal of the project is to develop a distributed virtual machine and bytecode specification to make it easier to integrate processors, storage systems, and query languages.

I recently gave a talk on mm-ADT at ApacheCon: https://www.slideshare.net/slidarko/mmadt-a-multimodel-abstr...

shostack6y ago

Do you have any articles you'd suggest that in your opinion cover the appropriate underlying guts of them?

Nican6y ago

Unfortunately, this has something that has been on my backburner for a while. I do not know of any articles that fill exactly the need that I am talking about. :(

j / k navigate · click thread line to collapse

170 comments

taffer6y ago

> Relational databases get their name from the fact that relationships can be defined between tables.

The basic paper by Codd[2] is really worth reading and describes, among other things, the problems of hierarchical and network databases that the relational model is meant to solve.

[1] https://en.wikipedia.org/wiki/Finitary_relation

[2] https://www.seas.upenn.edu/~zives/03f/cis550/codd.pdf

tlarkworthy6y ago

What is the relation? The table? (I.e. the tuples describing the rows). Or is it the joins? (In which case the article is correct). The columns?

woolcap6y ago

[1] Chris Date, Database in Depth, page 46

talaketu6y ago

We started with some data for a relation corresponding with a concept, and were able to operate on the data to produce a new relation - data corresponding to a new concept.

baddox6y ago

taffer6y ago

The table is the relation. A join is just an operation that combines two relations to a superrelation.

Pheebau6y ago

A join in an intersection in relational algebra

3 more replies

skybrian6y ago

It's essentially a table-shaped value. Conceptually it's immutable, and relational algebra is about calculating new values from old ones. A select statement does this too.

dsego6y ago

In a sense it reminds me of sentence structure. The table/relations is like the predicate and the rows contains all the subjects/objects that the predicate applies to.

ashelmire6y ago

A relation is a tuple (H, B) with H, the header, and B, the body, a set of tuples that all have the domain H.

jrapdx36y ago

That's certainly not a rigorous definition but helped me keep my head straight about what I was doing.

nikolasburk6y ago

Thanks for the hint, we'll update the article! :)

triska6y ago

Related to the logical view, it would also be great to include deductive databases:

https://en.wikipedia.org/wiki/Deductive_database

Deductive databases derive logical consequences based on facts and rules.

Datalog and its superset Prolog are notable instances of this idea, and they make the connection between the relational model and predicate logic particularly evident.

mitchtbaum6y ago

It seems these rules refine the data outside the database itself. But they're so tightly integrated between the database and the application that the lines separating them become blurred.

1 more reply

woolcap6y ago

> Relational databases get their name from the fact that relationships can be defined between tables.

Recommended Reading: Database In Depth, by Chris Date.

[1] https://en.wikipedia.org/wiki/Relational_model

[2] https://en.wikipedia.org/wiki/Relation_(database)

reilly30006y ago

edmundsauto6y ago

Mind sharing the course? Sounds useful.

reilly30006y ago

https://www.coursera.org/learn/data-manipulation

tacotime6y ago

here is one: https://lagunita.stanford.edu/courses/DB/RA/SelfPaced/about

pjungwir6y ago

[1] https://gertjans.home.xs4all.nl/usenet/microsoft.public.sqls...

einpoklum6y ago

The document completely overlooks Columnar databases, which are focused on analytics and are much faster than (most, not all) general-purpose DBMSes. See:

https://en.wikipedia.org/wiki/Column-oriented_DBMS

and

https://www.slideshare.net/arangodb/introduction-to-column-o...

or get:

http://www.nowpublishers.com/article/Details/DBS-024

Examples:

* MonetDB

* SAP Hana

* Actian Vector (formerly Vectorwise)

* Oracle In-Memory

ralusek6y ago

kthejoker26y ago

Also Druid, HBase, Vertipaq (engine behind PowerBI), Redshift, Azure SQL DW, etc

Columnar compression is a really interesting engineering problem

nicoburns6y ago

Also notable is the Postgres column store extension.

Not as fancy / performant as the dedicated columns store databases, but it allows you mix and match row-tables with column-tables which is pretty nice.

einpoklum6y ago

Yes it is, and I've written academically about this. For example: https://is.gd/9wwjjf

shrumm6y ago

ClickHouse is another favourite

einpoklum6y ago

Clickhouse is a columnar system, yes, but is not a full-fledged DBMS. Specifically, I don't think it can join tables.

NotSammyHagar6y ago

memsql (which was in there for new sql).

einpoklum6y ago

Eh... not quite. It's in-memory representation is row-based. It seems it uses columnar secondary storage. At least - that's what it says here: https://en.wikipedia.org/wiki/MemSQL

takeda6y ago

They are much faster for analytical queries, not for transactional ones. If they were overall faster they would be the default relational databases.

einpoklum6y ago

Yes, I did say "focused on Analytics" but I didn't clarify that they're only/mostly fast for analytic workloads.

cptnapalm6y ago

I know very, very little about any of this, but would this be akin to entity component systems in video games? Forgive me if I'm way off.

einpoklum6y ago

Sorry, I don't know what "entity component systems" are. If you mean saving the same component for all entities, rather than saving a bag of components for each of the entities, then sort of.

cptnapalm6y ago

Close enough to be helpful. Thank you.

muydeemer6y ago

gibsonf16y ago

planck016y ago

kthejoker26y ago

Stardog, MarkLogic, Virtuoso, AllegroGraph, and RDF4J all have commercial applications, but yeah in general semantic RDF is dying on the vine.

gibsonf16y ago

hmottestad6y ago

Stardog is quite successful.

dehrmann6y ago

Does Top Quadrant have anything that does this?

AlphaWeaver6y ago

Most everything they make is open source and really well designed. Would recommend checking it out!

nikolasburk6y ago

Nikolas from the Prisma team here! Thanks a lot for the endorsement, we're indeed super excited about the current database space and the tooling we see emerging.

For anyone that wants to check out what we're up to, you can find everything you need in this repo: https://github.com/prisma/prisma2

616c6y ago

nikolasburk6y ago

Feel free to reach out to me: burk@prisma.io or @nikolasburk on the Prisma Slack https://slack.prisma.io

SPascareli136y ago

I was looking into Prisma + Apollo, can you expand more on your problems with it? To be it seemed really magical at first, but I don't know how it really works in production.

muydeemer6y ago

Just a quick remark on graph dbs. Titan which is mentioned in the article as an example of a graph db is dead. Its successor is the Janus graph (https://github.com/JanusGraph/janusgraph).

planck016y ago

I am surprised Dgraph isn't mentioned as an example. It is the most starred graph db on Github, and I think it is the best one in terms of performance and scalability.

mdaniel6y ago

Strange that they have to have such a non-standard license, when they go out of their way to mention Apache 2 several times: https://github.com/dgraph-io/dgraph/blob/master/LICENSE.md

Contrast that with Orient, who also have an Enterprise version, and they just straight-up say "Apache 2, no drama" https://github.com/orientechnologies/orientdb/blob/develop/l...

Their client reminds me of the days before ORM: write a big string literal and send it to the server: https://github.com/dgraph-io/dgraph4j#running-a-query

mrjn6y ago

> write a big string literal and send it to the server

[1]: https://graphql.org/ [2]: https://blog.dgraph.io/post/why-google-needed-graph-serving-...

campoy6y ago

Hi there,

Head of Product at Dgraph here. Our license is Apache v2 for most of our source code.

Only our enterprise features are licensed with the Dgraph Community License [DCL](https://github.com/dgraph-io/dgraph/blob/master/licenses/DCL...).

We decided to make all of our code available on GitHub regardless of the license (unlike OrientDB if I understand correctly) so anyone can see what the Enterprise Edition really does.

Let us know if this is confusing. If you don't want the proprietary code, you can always only use the open-source version of Dgraph with `make oss`.

I'm also really interested in seeing where GQL (gql.today) is going, adopting a community-wide standard would be a definite yes from us.

In any case, thanks for your feedback. I'd love to hear more on how we can make our product better, feel free to email me at francesc@dgraph.io.

FridgeSeal6y ago

I believe support for additional query languages is in the roadmap.

mrjn6y ago

(Dgraph author) Particularly embarrassing because I actually know the founders of Prisma ;-). Amazing folks!

They even included YugaByte, with only 2.8K GitHub stars. Dgraph crossed 11K GitHub stars and is in the top 10 Graph DBs on DB Engine now -- what would it take for us to be in the article, Søren?

Just joking. Nice article! Keep up the good work, guys!

imchairmanm6y ago

1 more reply

fitzoh6y ago

There's also Datastax Enterprise Graph (commercial) from the team behind Titan after they were acquired by Datastax.

https://www.datastax.com/products/datastax-graph

https://venturebeat.com/2015/02/03/datastax-acquires-aureliu...

imchairmanm6y ago

Thanks for the heads up. I'll try to incorporate that into the article soon.

tabtab6y ago

https://stackoverflow.com/questions/66385/dynamic-database-s...

Some smaller university out there can make a name for themselves by implementing it. I've been kicking around doing it myself, but I'd have to retire first.

asah6y ago

https://www.google.com/search?q=expression+index+postgres

https://www.postgresql.org/docs/10/ddl-partitioning.html

zawerf6y ago

I am trying this out and I am still on the edge of whether I like it or not.

Create a table with a json column:

  CREATE TABLE Doc (
    id UUID PRIMARY KEY,
    val JSONB NOT NULL
  );

Then later it turns out all documents have user_ids so you add a check constraint and an index:

  ALTER TABLE Doc ADD CONSTRAINT check_doc_val CHECK (
    jsonb_typeof(val)='object' AND
    val ? 'user_id' AND
    jsonb_typeof(val->'user_id')='string'
  );
  CREATE INDEX doc_user_id ON Doc ((val->>'user_id'));

jayd166y ago

At this point, I would just bite the bullet, break out a new user_id column and run a migration to populate the column.

takeda6y ago

If you really want to use uuid and care about performance you might prefix it with something that's increasing like a date, or perhaps (did not try it) use hash index (need to be PG 10+).

1 more reply

tabtab6y ago

It seems to treat "direct" columns different than JSON-ized columns. That bifurcation creates two ways of doing things. Ideally, it should be seamless.

ucarion6y ago

Do you have any resources online for how well this approach works in practice? I've been thinking about doing this in replacing a MongoDB database.

cagmz6y ago

Once we felt that the schema wasn't changing as much, had too many concerns, etc, we made tables and wrote to them (instead of the JSON column).

pas6y ago

Could you please elaborate on why do you want to replace MongoDB?

thunderbong6y ago

Have you seen RedBean ORM? https://www.redbeanphp.com/index.php

tabtab6y ago

I don't really want an automatically generated static RDBMS, I'd like a dynamic RDBMS. For one, one can use it with any programming language, not just PHP.

pas6y ago

You can create indices in MongoDB on JSON, works pretty well.

And since 4.x they have transactions over shards too!

tabtab6y ago

pas6y ago

I find SQL much more awkward than proper language-idiomatic APIs.

bryanlarsen6y ago

kps6y ago

Yes. The origin of the flat-file database is fixed-format unit record equipment¹, predating computers. COBOL is essentially a language designed for processing fixed-format files.

¹ https://en.wikipedia.org/wiki/Unit_record_equipment

imchairmanm6y ago

That's definitely a good point. I'll try to update the article to reflect that soon. Thanks for the feedback!

rainyMammoth6y ago

What about time series databases that are fairly common nowadays ?

manigandham6y ago

Hbase/Bigtable/DynamoDB/Cassandra are key/value. InfluxDB is key/value. Timescale is an extension to Postgres.

jnordwick6y ago

manigandham6y ago

1 more reply

shakkhar6y ago

If you could store timeseries data "in any database", kdb wouldn't be a thing. Just go and ask a quant trader replace his kdb instance with postgres. (Be prepared to be laughed out of the room.)

manigandham6y ago

I'm not sure what your point is. Time-series describes the data, not the database.

1 more reply

neop1x6y ago

nudpiedo6y ago

TheMiller6y ago

edmundsauto6y ago

_Understated_6y ago

I'm curious... what did the author mean by this:

I didn't notice anything that went into any detail about legacy database types.

Any idea what the author means by a "Legacy Database"?

AlphaWeaver6y ago

The article gives some examples of these, it seems they're mostly referring to things we wouldn't consider "databases" like flat files.

tempguy99996y ago

He actually tells you in the article, straight after (flat file, hierarchical...)

_Understated_6y ago

Aww man. I am a dumbass... I never equated that section of things like Network databases and such as legacy.

Dunno how I missed it :(

*Must read slower...

tempguy99996y ago

We've all done it :) No worries!

AtlasBarfed6y ago

Mainframe days

honkycat6y ago

There is a great chapter in "Designing Data Intensive Applications" about this very subject

bryanrasmussen6y ago

Again a renaming that makes what the article is actually about less clear.

victor1066y ago

Can anyone here point to a resource that gives a comprehensive treatment (use cases, pros and cons etc) of all the types (Nosql, NewSQL, relational, timeseries) of databases being used today?

thekhatribharat6y ago

[Shameless Plug] A summary of the evergrowing NoSQL and NewSQL market: https://medium.com/open-factory/nosql-newsql-a-smorgasboard-...

minitoar6y ago

imchairmanm6y ago

Hello, author here. That's a good question and something I had a hard time sorting out as I worked on this.

My colleague shared this article [1] with me, which definitely helped inform how I distinguished between the two in my head.

[1] http://dbmsmusings.blogspot.com/2010/03/distinguishing-two-m...

minitoar6y ago

That makes sense, they really are just relational databases optimized for certain tasks, with corresponding limitations e.g. they don't support arbitrary joins.

barrkel6y ago

2 more replies

intellix6y ago

Skipped through it looking for an answer but didn't see it: where are unions in Prisma?! Was looking for some big reveal about an underlying choice that enables what everyone is begging for

matthewmueller6y ago

I've been mapping out union types at Prisma.

Are you in our Slack? I'm @mattmueller at https://prisma.slack.com.

I'd love to chat with you to better understand your use cases, so we can make sure we're designing it for you.

dehrmann6y ago

The column-family databases mentioned (Cassandra, HBase) are both just fancy key-value stores that add semantics for separate tables and cell-level data so you're not rolling it yourself.

neop1x6y ago

No Elastic among the examples while highly popular and nice :( Great article overall, though!

CMCDragonkai6y ago

There's also column-oriented or array databases like MonetDB and Rasdaman.

zbentley6y ago

And the elephant in the room: Cassandra.

sourcepath6y ago

What happened with Prisma being all about graphql?

nikolasburk6y ago

You can also find examples for the various use cases here: https://github.com/prisma/prisma-examples/tree/prisma2

Please let me know if that clarifies it or if you have more questions! :)

galaxyLogic6y ago

What happened to Object-Oriented Databases?

AtlasBarfed6y ago

Document databases kind of killed them I would specualte. Since JSON serializes with objects so much better than (ugh) XML, the Relational impedence is gone (well, a lot of it).

chrisweekly6y ago

I didn't RTFA, but based on titles alone, isn't "Object Database" missing from the list?

marcosdumay6y ago

Aren't those a special case of hierarchical databases?

marknadal6y ago

What a lovely article!

This makes graphs a superior data structure.

If you think about the math, any document is a trie, and tables are a matrix. Both trees and matrices can be represented as graphs. But not all graphs can be represented as a tree or graph.

This gets even more fun when you get into hypergraphs and bigraphs, which are totally possible with property graph databases where nodes have type!

jt21906y ago

> This makes graphs a superior data structure.

Whether a graph data store or a more specialized tool (e.g. a relational database, etc.) is superior depends (as I'm sure you agree) on context.

danenania6y ago

“It should be emphasized that graph databases can do all other types of databases (relational, document, key/value, etc.)”

Not to knock graph dbs, but isn’t the reverse also true?

mumblemumble6y ago

https://static.googleusercontent.com/media/research.google.c...

gbear06y ago

1 more reply

AndrewBowman6y ago

lmkg6y ago

henryfjordan6y ago

I don't quite get what you mean by "the relational model itself cannot fully represent the graph model". If you can store all the edges of a graph and query them, what can't be done?

It all comes down to use case

danenania6y ago

ithkuil6y ago

(Obviously, the underlying storage layer of a graph db will use some sort of simpler storage layer, usually some kind of key value store)

namelosw6y ago

danenania6y ago

"key-value databases cannot efficiently express joining relation unless load all the data in memory"

1 more reply

cbtx6y ago

There's a difference between being able to do something and doing something well.

kristoff_it6y ago

Definitely not a good description of Redis, even though they cite it as the first example of a Key-Value DB.

nailer6y ago

> Relational vs. Document

takeda6y ago

The adjective relational in a relational database comes from mathematical relations, tuples i.e. data in tables.

It's common misconception that it is from foreign keys.

nailer6y ago

That's interesting - it seems to both be backed by and conflict with a lot of https://en.wikipedia.org/wiki/Relational_model but maybe that's wrong.

Nican6y ago

scottlamb6y ago

> For example, I am usually surprised when people talk about "web-scale", but they do not understand the difference between a "merge-join" and a "hash-join".

jorblumesea6y ago

> Computers, in the end, are just a Turing machine.

> Just pay attention that the query that you are executing is actually doing the optimal solution.

Isn't that the entire reason for articles like this? eg: If you have data with large amounts of relations a no-sql database probably isn't the right approach.

> I wish more time would be spent talking about the underlying algorithms that the different query languages use to accomplish the tasks.

threeseed6y ago

This is non-sensical.

judofyr6y ago

> This is non-sensical.

Your examples are kinda proving yourself wrong though.

> Take a highly nested JSON document and try and implement it in a relational database. You would have an O(1) lookup in MongoDB and a O(n) lookup in MySQL.

In PostgreSQL (which is not MySQL, but it's a relational database) you can create an index for a column which contains JSON and the lookup for a nested field becomes O(1).

> Or good luck traversing a graph structure in MongoDB when you should've used Neo4J.

Facebook implemented the social graph on top of MySQL: https://www.facebook.com/notes/facebook-engineering/tao-the-....

> So in order to have a performant database you do need to "pay attention" and ensure that your access patterns suit the database type.

threeseed6y ago

That's because you took my examples and distorted them.

Nican6y ago

It is not about the database type. It is about the underlying algorithms and data structures.

doh6y ago

But you can really store any data in any database, you just might have to do some extra pre/post-processing and multiple the storage requirements.

[0] http://karras.rutgers.edu/hexastore.pdf

zawerf6y ago

Technical note: big oh isn't a useful measure here. Most databases use b-trees (yes, even mongo) so lookups are at best O(log(n)). That goes for you and the people replying too.

alexchamberlain6y ago

When you say O(1)/O(n), what are you measuring the scale in relation to? Number of documents? Number of keys in the document you're looking for?

roywiggins6y ago

Working in a relational-ish database built on top of a poorly-abstracted B-tree object store is among the top ten worst development hells I've personally endured. It was awful.

manigandham6y ago

The type of database and data model is critical. People trying to make one act like another has led to countless problems.

Everything you said about algorithms and query complexity is part of that foundation, not a replacement for it.

bradly6y ago

> I am really tired of articles that talk about the different types of databases.

I read this article and learned things I didn't know.

paulddraper6y ago

> I am really tired of articles that talk about the different types of databases. People can make a graph databases act like relational databases, and vice-versa.

> I wish more time would be spent talking about the underlying algorithms

Did the article come up short?

Nican6y ago

To give some examples, CockroachDB both has column-families and NoSQL characteristics. A column can be specified to be a JSON column, and it can have an inverted index.

Or MemSQL has both row-based and column-based tables, and they use an unorthodox index called "skip lists".

CockroachDB and MemSQL both have different applications and characteristis, but they are just cluttered under "NewSQL", as if it was just was some kind of "SQL but better".

ska6y ago

> Computers, in the end, are just a Turing machine.

Famous last words :)

Detail, in the end, really matter.

tempWinHater6y ago

Especially when you consider that computers are approximations of Turing machines.

Edit: typo

okram6y ago

I recently gave a talk on mm-ADT at ApacheCon: https://www.slideshare.net/slidarko/mmadt-a-multimodel-abstr...

shostack6y ago

Do you have any articles you'd suggest that in your opinion cover the appropriate underlying guts of them?

Nican6y ago

Unfortunately, this has something that has been on my backburner for a while. I do not know of any articles that fill exactly the need that I am talking about. :(

j / k navigate · click thread line to collapse