Really, the big thing that I see here is the transaction retry logic, which (if the writes in your transaction depend only on the reads on not side effects, which you’d need to declare) would be a nice feature for any db engine, and (if the side effects are packaged together in a safe way with the relevant reads and writes, which can be more general) would also be a nice feature for a high-level db client library, independent of backend. But that’s a pretty thin reed to hang a dependency on a proprietary SaaS solution over, say, Postgres on.
Fair enough. The goal of the post wasn't necessarily to comprehensively describe the product–so the specifics of what we provide are found elsewhere in our docs, etc.
> Really, the big thing that I see here is the transaction retry logic,
These transactions run as embedded functions inside Convex in a runtime where it is impossible to have side effects.
There's no way to express data structure knowledge in sql, only relations and keys. Discoverability can be quite lacking. And every time I want to join table A to B I have to re-define everything, because SQL doesn't store that.
ORMs help. They help because they encode relationships in meaningful ways. A has many Bs, so A.B works, and I don't need to repeat this join logic every damn time. But ORMs have down sides too. Sometimes they generate queries that are really sub-optimal.
I think my favorite ORM usage was with Hibernate. I KNOW!!! THAT DEVIL! But honestly we wrote SQL in hibernate, and then invoked it to populate our data. Yes it was a bit more work than RoR's "order.items.where("price > 40")" but when complex things happened, it was always easier.
I've always looked at nosql as "absolutely, sounds great, how do you represent relationships?"
Relations, keys, and (though you forgot them) constraints express…quite a lot of data structure knowledge.
> And every time I want to join table A to B I have to re-define everything, because SQL doesn’t store that.
SQL stores that if you tell it to; the usual way being view CREATE VIEW.
For example, you have a many to one relationship between posts and users. Instead of this:
select *
from user as u
join post as p
on p.user_id = u.user_id;
You could do: alter table post
add constraint fk_user_id foreign key (user_id)
references user.user_id;
select *
from user as u
left referent join post as p;
Any good sql auto-completer will also look up the foreign key information and auto-generate the on clause for you as well. Redgate SQL Prompt (mssql only) is one of the best tools out there for this reasonhttps://www.postgresql.org/docs/current/queries-table-expres....
No, “using” doesn’t automatically join by foreign key, it joins by explicitly-provided column name which must be identical between the two tables.
> If you didn’t need to specify the column then adding a new foreign key between the tables would make existing queries ambiguous and break backwards compatibility.
Not if you specified with constraint name rather than the column name (which also works for multi-column foreign keys without having to reiterate all the columns, being much more DRY than current SQL USING.)
Better, with that approach you could also allow fully implicit joins, using the foreign key constraint name as if it were an row-valued field in the referencing table.
E.g.:
CREATE TABLE employees (
id BIGSERIAL PRIMARY KEY,’
name VARCHAR,
manager_id BIGINT CONSTRAINT manager REFERENCES employees
);
would let you do: SELECT name as employee_name, manager.name as manager_name
FROM employees
as syntax sugar for: SELECT ee.name as employee_name, mgr.name as manager_name
FROM employees ee
LEFT JOIN employees mgr ON (
employees.manager_id = mgr.id
)Meanwhile, the obvious natural way to make joins isn't available.
A richer type system would help there, but to my knowledge this isn't offered by any major relational DBMS.
Postgres allows you to define custom types at least; making some enums or composites might give you some measure of sanity..?
The Ada programming language is known for encouraging use of specialised numerical types, where the range can be expressed and automatically checked at runtime, and the compiler can help protect against nonsensical comparisons or arithmetic. At a glance I think Postgres range types offer something similar. Presumably strong typing could also be achieved with Postgres composite types (of just one member).
It would be useful to have such checks so that a nonsensical comparison like WHERE person1.height = person2.net_worth would result in an error. Presumably it could also be helpful for autocompletion.
Wow.
And now both of us are back on non-SQL datastores: Convex for him, and Nomad's combination of Raft+MemDB for me. While Convex sounds influenced by transactional memory (define your critical section and let the platform keep trying to apply it until it succeeds), Nomad opted for Raft's serialization of mutations. On the read side we both opted for explicit index use, so it seems fair to assume both of us feel like SQL's decoupling of indexes from queries does more harm than good. Query planners are great until they're not.
I would love a SQL (well SELECT) interface to Nomad's datastore (MemDB). SELECT is like the C ABI for data instead of code: imperfect, but a lowest common denominator that if implemented unlocks a ton of workflows.
I wonder if jamwt feels the same and both projects will eventually grow a SQL adapter (or at least a SQL-like DSL... Nomad is close with bexpr filtering on a number of APIs, but JOINs would be really useful).
SQL is the C ABI of querying for sure. BI tools will never adapt to use Convex directly, and nor should they.
So... yes, Convex actually had a prototype SQL adapter for the read side of things back in the early few months when we were figuring things out. Convex's read semantics are very compatible with SQL.
We've kept this adapter on ice in part because of point #3 in the article -- we don't want to overpromise things which are a bad idea.
Meaning, if we exposed SQL on the thing as-is, this would presumably be for more analytical type queries involving patterns normal Convex queries can't express. Right now that would be a Bad Idea because your website would slow down just like every other database system allows you to.
So the current recommended practice is use our Airbyte Egress connector (https://airbyte.com/connectors/convex) and get yourself into an offline Clickhouse/MySQL/Snowflake whatever and jam SQL over there to your heart's content. That's basically what we do.
We may one day abstract this away by embedding some sort of mirrored column store sql thing (maybe DuckDB based? who knows) so you can do your analytical work without impact on your OLTP or integrating 3 more systems. But the team hasn't invested in that yet and probably won't for some time.
> Convex actually had a prototype SQL adapter
Ha, this sounds like the heaps of scripts we Nomad developers have lying around for converting ad hoc bits of Nomad state to sqlite for debugging particularly stateful problems.
> So the current recommended practice is use our Airbyte Egress connector
Ah this is where we would like to end up with Nomad's event stream API: https://developer.hashicorp.com/nomad/api-docs/events#event-...
Sadly we've been unable to identify that First Sink to implement and open source. It's a lot of work to build and maintain production quality connectors, so it would be nice to do just one like your Airbyte Egress. This is an area where Nomad's onprem nature makes it hard to pick one solution that will work for a wide number of users. Someday!
Haskell: IO -> STM -> (pure)
Convex: (The browser / Convex Actions / other effectful environments) -> Mutations -> queries
All the same benefits re: retry and memoization.
Steal steal from Haskell, so many great ideas there.
It may actually be fine to just require everything to be typescript, but the idea that you'd require your application be written in the same language as your data store, and thus implement a different data store for each language you might want to write an app in (and not share between languages)... would formerly be thought of as pretty ridiculous?
But maybe it's not anymore?
But that seems to be the requirement for "the application and database types are automatically equivalent because the entire data pipeline from frontend to database uses the same exact types and definitions," yes?
In general, most of the purported benefits of convex seem to effectively require the db in the same language as the app, to make it all so seamless.
I guess we'll see if Convex can get popular enough to spawn lots of "what I wish I knew about convex a year ago" and "Goodbye Convex, hello Postgresql" posts...
> It may actually be fine to just require everything to be typescript, but the idea that you'd require your application be written in the same language as your data store, and thus implement a different data store for each language you might want to write an app in (and not share between languages)... would formerly be thought of as pretty ridiculous?
The vision is definitely aspirational, and is reflecting on the fact that JS/TS is becoming overwhelmingly the most popular language for projects. With things like React Native, for many teams even mobile apps are taken care of all on one language.
There will obviously always be apps written in other languages, and in those cases, a bit more type mapping will be necessary, because (for now) Convex functions are always side-effect free, automatically cached JS/TS running in a very particular runtime. But we'll work on making that translation as smooth as possible in our client libraries.
We have a Python client library out now ( https://pypi.org/project/convex/ ) and a Rust one coming soon, just to keep us honest in this respect.
> I guess we'll see if Convex can get popular enough to spawn lots of "what I wish I knew about convex a year ago" and "Goodbye Convex, hello Postgresql" posts...
Well, that's the dream, isn't it?
What about "all write operations on the database are done in embedded functions"? For the python and rust client libraries, do you write these embedded functions in python or rust, or typescript/JS? Do you effectively have to write a specialized update procedure in typescript for every update you want to do? This starts seeming harder to spin as an advantage if you aren't typescript in the first place. ("A custom stored procedure required for every update" is actually not something anyone's been looking for!)
The OP arguments seem to me to be written assuming you are writing typescript in your app, it's unclear to me how the advantages you claim translate if you are not -- such that I figured the product was actually only for typescript apps! It would be interesting to see a rewrite of it that explicitly targets those writing in, say, python or rust.
But no, not necessarily exclusively TypeScript. For example, the existence of the Python client library is due to developer demand. Some users had ML jobs that are triggered by Convex applications or reported outcomes into Convex.
The embedded functions are always TypeScript. The experience is pretty smooth in other languages as well-but the degree of this is largely dependent on how easily the type systems map from e.g. Python to TypeScript and how our client library can infer things in a way that feels native or requires little of your involvement.
Underlying vision or assumption that we’ll all be using JavaScript/TypeScript in any case: Freaking really??
My limited experience with LINQ to SQL was very positive in these regards - but one is not writing a type safe SQL at all, instead a language that is type safe and is ultimately interpreted as SQL.
Same experience with some libraries in Scala - where it's possible to get compile-time type safety for SQL - but required onerous setup or synchronization/code generation tools to achieve. (Sort of comes with the territory.)
Have other implementations done anything for making queries easier (that don't involve requiring an IDE or anything, so just query language innovations)?
Edit: Not query language innovations but ClickHouse and Snowflake's "standard library" of builtin functions just keep getting better and better. Once I saw `parseDateTimeBestEffort` in ClickHouse I wanted that everywhere...
Another interesting bit of work in this space is LINQ on the microsoft side of things. And even list comprehensions end up having a scanning/summing language kind of feel to them that would be interesting to see translated into a new database query language.
This is a false statement. Both INSERT and UPDATE support JOINs and subqueries / CTEs. At least according to the standard - not every engine implementing them is another story.
Which don't? I'd have assumed anything inside of `SELECT`'s `FROM` would be allowed inside of `INSERT` and `UPDATE`.
Or maybe you're not saying you know there are implementations that have these restrictions just that any random implementation might not be there (yet).
Redshift (a PostgreSQL derivative, more or less) can run an INSERT with a join, but cannot do an UPDATE with a join.
Also, regarding the transaction functions, are those run locally, or are they serialized and run remotely on the database server? Both have their caveats.
Number 1 (and that approach to retries) already works just fine with Postgres, you just need a proper library (which wraps your function and retries on transaction isolation errors). But you also have to keep in mind that if you interface with 3rd party services during your transactions, those calls will need to be idempotent as well. This is actually the proper way to do transactions in application code with a SQL database, esp. if you're running with serializable transaction isolation.
Number 2 is very limiting, as you can't have all those third party services and libraries used between different operations in a transaction, which is often (I'd even argue - usually) very useful.
Since you're citing stored procedures as a viable alternative to convex's take, while not listing the above number 1, it sounds like you're doing 2.
Anyhow, good luck!
[0]: https://entgo.io
SELECT preceeding FROM is such a thorn in the side. :(
; SELECT FROM sources ... WHERE / ORDER / ETC ... DATA fields ...
What if DATA was a reserved keyword for column identifiers and other data fields that would normally be immediately after SELECT so they could appear anywhere in the query syntax?
Can you imagine how much time and effort that would save people?
Edit: You do need commas in ORDER BYs, that slipped my mind when typing out this pet peeve of mine.
select
1
,field1
,field2
,field3
from ...
you've hit peak SQL engineer comma frustration workaround.> Can you imagine how much time and effort that would save people?
I can imagine how much extra time I’d spend typing “AS” in queries, which would ne necessary for column aliases to be distinct from new select items. And how much more time I’d spend reading unreadable queries in contexts like logs where they aren’t pretty-printed. Net, it seems to be a big loss.
> We don’t need commas in between joins.
That’s because each join is introduced by a string of one or more reserved words (including JOIN).
If each select field was prefaces with FIELD you wouldn’t need commas, but...
Also, I seem to recall that very old versions of Oracle did require commas between joins. (The join conditions had to be stated in the WHERE clause back then, instead of supporting the ON clause, iirc).
There are indeed times when a simple key-value store is appropriate, but the encoding of hierarchy and the relational calculus are useful for so many more use cases in business applications. SQL as a language is certainly flawed - it would be nice to be able to algebraically/programmatically join select clauses and conditions, for instance, or to perform multi-inserts - but the semantics of the storage and engine are based on some pretty mathematically sound foundations.
You may find the seminal paper by Codd as illuminating as I did if you read it.
I think any SaaS APIs that reach any degree of widespread adoption eventually get OSS implementations (eg S3's API is widely supported by FOSS and proprietary implementations). This evolutionary path is awfully similar to SQL's for better and/or worse.
Agree this is a necessary part of the change. Convex is working on our OSS strategy this year. Thanks for the feedback!
It's not new "in the world", but often new to our users, who understand it less intuitively than pessimistic locking.
Unless they're using Haskell or probably other very exotic environments with strict understanding of mutability and so on, they know Mutexes but not STM or other OCC patterns. So we lean heavily on explaining it. Not because it is original, but because it is unfamiliar.
The elephant in the room is, I suppose, that the modern internet literally would not be possible without NoSQL. It may be possible without SQL; that seems likely to me. Part of that is because NoSQL is a big umbrella, and covers extremely critical databases like Redis or even databases like Cloudflare's proprietary edge cache. But, even document stores are extremely critical to enterprise scaling; during Prime Day 2022, DynamoDB peaked at 150M rps. There's no SQL setup on the planet that could handle volume like that while still maintaining all the things that Make It SQL; you could start throwing read replicas and removing joins and cross-table lookups and sharding data and ope, you just re-invented a document store.
Here's the couple conclusions I have started operating by:
1. Document stores are, today, a strong choice at both low and high scales on the spectrum of systems scaling. Its great at low scales because you can avoid thinking about it too much. Its great at high scales because once you have the space to think about it you can attain substantially higher efficiency (performance+cost).
2. Making a document store operate more like SQL, for the things SQL is good at (joins, validation, etc) is a lot easier than making a SQL database operate like a document store for the things document stores are good at (Planetscale? there's a couple players in this game).
3. SQL-the-language sucks. There I said it; I'll die on that hill. The language was invented at a time +/- 2 years of Scheme, ML, Prolog, and Smalltalk. Our industry has rejected all of those. We haven't rejected SQL (yet). SQL is demonstrably, as a syntax, just as bad as those languages, evidenced by all the ORMs, injection attacks, etc. Databases tend to have a longer shelf life than programming languages, but SQL-the-language will die.
4. But, duh, SQL is fine. Go ahead and use it. In some situations it makes sense. In others it doesn't. Its a tool; one that has had 60 years to form around solving as many problems as possible.
[1] https://aws.amazon.com/blogs/aws/amazon-prime-day-2022-aws-f...
For a lot of NoSQL document stores, the simple acting of having multiple indexes was a bridge too far. Couchbase comes to mind here.
I can't think of many use cases for a document store over PostgreSQL. Maybe as a cache layer...but I'd just use a materialized view. Where some of the data structure was going to be unknown or user defined, such as a system monitoring tool? Server Density was one of the first big MongoDB adopters for this reason. Now we can store than in a JSONB column though.
> 3. SQL-the-language sucks.
You're entitled to your opinion. I've been doing this 20 years and basic SQL knowledge has been by far the most valuable and portable skill of my career. I'm by no means an expert, but you asked developers who have been in the field for 5+ years what a HAVING statement does and they have no idea. There's a huge gap in basic SQL knowledge, stuff that can be learned in < 1 week. I taught somebody who came out of a code boot camp basic SQL for a couple of weeks during an internship. Her first job made her acting DBA because she was the only person who knew SQL.
The hill I'll die on, is that ORM's and frameworks are creating an astounding lack of basic SQL knowledge in today's developers which leads to a lot of completely unwarranted griping about SQL. (not directed at you, just in general on this topic)
Its the same argument, just at a grander scale, for the generation before us and the COBOL mainframe programmers of old; database technologies just tend to move slower because data is very sticky.
You might not see the use-cases; but Apple does. Their usage of Cassandra sustains in the millions of QPS across over 300,000 instances. That's a scale SQL can't even begin to dream of. Discord does; they moved to Scylla [2]. Amazon does [3]. Blackrock Aladdin depends on Cassandra, and manages over $21T in assets [4].
Which kind of circles around to the fact that, while you say "a lack of SQL knowledge is what causes unwarranted griping about SQL", sure I can accept that, but I'd conversely argue that a lack of Solving Real Problems is what causes unwarranted griping about NoSQL. Schematization isn't a real problem; we have a dozen ways of solving that at different layers. Imperfect ACID compliance isn't a real problem; its a pseudo-academic expression of the guarantees that SQL offers, which then gets unfairly transplanted onto NoSQL databases as if those issues can't or wouldn't be addressed elsewhere. Even original article statements like data being inherently relational; data isn't anything, its just data, it just appears relational because its how we've been trained to think about data after 60 years of SQL-monoculture. If you start thinking about data end-to-end, how its displayed to users, on pages, components, screens, feeds, it really doesn't look relational at all; life isn't a graph, the graph is an expression of human pattern-matching on things that could reasonably be related, but in reality aren't; life is a series of snapshots. This moment, this page, this feed, this component, in this moment, for this request, for this session; that is the end-to-end view of data; documents, not a web.
But SQL is fine at that! In fact, its a best option at medium scales; the middle of the bell curve where 80% of people live. It may not always be, but its there today. Its that bottom 10% and top 10% where the solutions that work for most don't work as well; and I think NoSQL gets a lot of Majority Hate despite being an extremely strong solution at those scales. Know your tools, and be willing to replace them when the circumstances change.
[1] https://twitter.com/erickramirezau/status/157806381149547724...
[2] https://www.scylladb.com/press-release/discord-chooses-scyll...
[3] https://aws.amazon.com/blogs/aws/amazon-prime-day-2022-aws-f...
Let's talk real use cases for a second.
All of the performance benefits of NoSQL come from key-value retrieval and the ability to shard data due to built in lack of joins.
There's nothing stopping anyone from doing that in a relational database. It's a common pattern. CitusDB even made a PostgreSQL extension that makes this model even easier to use with joins and all the other relational goodies. If you want to setup PostgreSQL with an id field and a JSONB field, you've got the entire NoSQL scaling use case covered without all of the limitations of using a NoSQL only database.
The schema at different layers thing doesn't _really_ work either. If you put the schema in your application code, then you mandate an API layer in addition to your database layer. With schema enforced in the database, multiple code bases can connect to the same database. With the schema in the application layer, now everything has to connect to that application. So not only are you working with the NoSQL database AND defining the schema in the application but now you also have to define an API if you have a reason to split out some portion of application logic to a more specialized languages.
The layers of problems introduced because NoSQL is used on a project are excessive. There's no question that storing document data makes a lot of sense in some circumstances, but you're always better off just storing the portion you need in an appropriate column type (JSONB) rather than forcing your entire system into the issues that come with NoSQL only.
Given the JSON capabilities of modern SQL databases there's almost no reason to start any project with a NoSQL only option.
In the prime day article for example, they're singing the praises of their own SQL offerings too.
> Amazon Aurora – On Prime Day, 5,326 database instances running the PostgreSQL-compatible and MySQL-compatible editions of Amazon Aurora processed 288 billion transactions, stored 1,849 terabytes of data, and transferred 749 terabytes of data.
> Amazon DynamoDB – DynamoDB powers multiple high-traffic Amazon properties and systems including Alexa, the Amazon.com sites, and all Amazon fulfillment centers. Over the course of Prime Day, these sources made trillions of calls to the DynamoDB API. DynamoDB maintained high availability while delivering single-digit millisecond responses and peaking at 105.2 million requests per second.
They don't given an apples to apples numbers comparison here. We don't have a cost comparison of infrastructure for DynamoDB vs Aurora. We don't get to see if "288 billion transactions" are database writes while we only see requests here. We don't get to see whether there have to be so many more requests to Dynamo because you can't get the data as easily in a single query. We don't get to see if it's being used as a cache layer or for session management, which are both great use cases for it.
I mean, Instagram scaled just fine with only PostgreSQL. SQL scales just fine. Some of the bevy of options and tools that you get to work with the data don't scale as well as others, but that's not a case for getting rid of those tools...it's just a case for caching when needed.
It's amazing how it all works. But only when it works. I've found it so flakey I don't use it for production but if the bugs were ironed out it would be amazing.
> We paused just long enough to take a sip of our Spicy Maya Mocha from Coupa Cafe.
To be honest, I think almost all drinks at Coupa Cafe are pretty bad.If we are begging DB engineers for things can we get a graph layer for accessing SQL tables please?
BEGIN;
SELECT post_count, ... from users where ... FOR UPDATE;
INSERT INTO posts VALUES(...);
UPDATE users SET post_count = new_post_count WHERE ...;
COMMIT;
For one, it's unlikely the number of posts per user is so important and so often requested that it needs to be cached in the database itself.Secondly, why would that value need to be stored with each new insert? Simply insert posts as they come, and calculate metadata about posts at a later time as a batch process, or when some part of the application actually request them.