We initially wrote the extension as an excuse to play with pgx, the rust framework for writing postgres extensions. That let us lean on existing rust libs for validation (jsonschema), so the extension's implementation is only 10 lines of code :)
https://github.com/supabase/pg_jsonschema/blob/fb7ab09bf6050...
happy to answer any questions!
I'd strongly prefer to have the application layer do generic json-schema validation since you can spawn arbitrary containers to spread the load. Obviously some things are unavoidable if you want to maintain foreign-key constraints or db-level check constraints/etc but people frown on check constraints sometimes as well. Semantic validity should be checked before it gets to the DB.
I was exploring a project with JSON generation views inside the database for coupling the DB directly to SOLR for direct data import, and while it worked fine (and performed fine with toy problems) that was just always my concern... even there where it's not holding write locks/etc, how much harder are you hitting the DB for stuff that, ultimately, can. be done slower but more scalably in an application container?
YAGNI, I know, cross the bridge when it comes, butjust as a blanket architectural concern that's not really where it belongs imo.
In my case at least, probably it's something that could be pushed off to followers in a leader-follower cluster as a kind of read replica, but I dunno if that's how it's implemented or not. "Read replicas" are something that are a lot more fleshed out in Citus, Enterprise, and the other commercial offerings built on raw Postgres iirc.
By centralizing data validation, it removes many potential failure and inconsistency scenarios (w.r.t different services validating things differently).
Worrying about CPU, without seeing if it's a real problem for your use case, is a premature optimization. Similar to worrying about foreign key constraint checks being too expensive. This is rarely the case, but if it winds up being a problem, you can relax it later and move the check it elsewhere in your stack (or remove it entirely).
Data storage and querying. Validation should be done in the application layer for security and scalability.
"Premature optimization" is the rallying cry of people standing in the corner of a room with a painted floor holding a paint brush.
If CHECKs are causing perf issues, then okay maybe put the problematic ones somewhere else.
TIL about https://github.com/mulesoft-labs/json-ld-schema: "JSON Schema/SHACL based validation of JSON-LD documents". Perhaps we could see if that's something that should be done in pgx or in the application versions deployed alongside the database versions?
Toon Koppelaars shows that in a few experiments done with Oracle ( https://www.youtube.com/watch?v=8jiJDflpw4Y&ab_channel=Oracl... )
Also very often doing the work in the DB means a simpler solution, fewer race-condition issues, and less work overall.
When talking about this particular case, the CHECK constraint, it is probably matter of measuring the impact and understanding at which point this might be relevant, and considering the cost of the alternative implementation at the application layer
It's good to be aware of additional strain you put on your DB, but I've worked with lots of applications where scaling simply wasn't a problem, and there I'd prefer validations closer to the data.
Adding a schema to it seems natural to me (as someone who prefers strong-typing), and the JSONSchema implementation seems pretty lightweight and sensible.
I'd go even further down the mining XML's legacy route and add better XPATH2.0 + XQuery - style support. XQuery was a great tool, so expressive. Though it did spawn XQueryX [1] which I'm still unsure whether it was a joke or not.
[1] https://www.oreilly.com/library/view/xquery/0596006349/ch22s...
https://www.postgresql.org/docs/current/datatype-json.html#D...
What is the use case for this versus normal column definitions, if you’re looking to enforce schemas?
Developers sometimes really just want to dump the data as a JSON. For them this means not having to write a lot of boiler plate ORM or SQL templates, and shipping trivial features quickly.
Example, user UI preferences are a good candidate for something like this. You probably don't want to add a new column just to remember the status of checkbox, when the user last logged in.
As a DBA, you probably still want to define a schema for this data, so as to not cause unexpected web app crashes. It ensures the some level of data consistency without increasing the maintenance overhead.
Obviously you wouldn't use it for business critical data, in my opinion.
Honestly, it seems like a middleware layer would suffice in order to stick with more "standard" postgres features. (not to indict this feature, but simply because its more likely a maintainer would understand and have experience)
Imagine you’re writing an editing interface of some sort with images, text blocks etc. Each is conceptually a “node” but have different attributes. A single node table with a column per attribute means redundant columns. A table per node type introduces other problems because you need to query all your nodes together. Basically however you model this you end up with tradeoffs and real apps can obviously get much more complex - next thing you have 50 tables and a complex query just to get the data out.
Contrast to the other extreme - storing the whole thing in a single hierarchical json document. There’s no restriction on the data shape, and you can just pass the whole thing around as json. Versioning becomes much simpler because you’re versioning a single document. Your export format is just the json.
There’s tradeoffs of course, and often a middle ground is the right approach - but json columns with validation definitely have their place.
The schema on that end is pretty intricate but to prevent hitting two services for certain types of data, we just dump it to a json column.
Furthermore, for a personal project of mine to help me with productiving / daily schedules, i'm using a json column for a to-do list in the schema of
{[some_todo_item]: boolean,}
which can't traditionally be represented in pg columns as the to do items are variable.
Totally get why you would want to just save data in whatever format you send it in, that's how I prototype as well. But a regular column has the advantage of familiarity with other devs, not to mention better syntax for querying.
I remember when kicking the tires on postgrest/postgraphile that I found validation and error handling to be one of the less intuitive areas. Not the actual field-level constraints, but how to adapt it to fit a fast-fail vs slow-fail model.
When I had attempted before, the only ergonomic option was fast-fail (the first check constraint violated would bubble the error upward) rather than slow-fail (collect all invalid fields and return the collection of errors, which IME is more common on average web forms or api requests).
Looking at the single code file and tests, I see only singular field errors. Has a more ergonomic approach to validation-error collection been developed other than writing a large function to iterate the new record fieldwise against the schema?
... > Has a more ergonomic approach to validation-error collection been developed other than writing a large function to iterate the new record fieldwise against the schema?
Thats an interesting problem that makes sense in the context of form validation.
re: have the ergonomics improved -> no
but you could hack that behavior together like this:
```
create or replace function handle_errors(errors text[])
returns bool
language plpgsql
as $$begin
if errors = array[]::text[] then
return true;
end if;
raise exception using errcode='22000', message=(select string_agg(x, ', ') from unnest(errors) v(x));
end;$$;
create table account(
id int primary key,
email text,
first_name text
check (
handle_errors(
array[]::text[]
|| case when email is null then 'email must not be empty' else null end
|| case when first_name is null then 'first_name must not be empty' else null end
)
)
);insert into account(id, email, first_name)
values (1, null, null);
-- ERROR: email must not be empty, first_name must not be empty
```
not that I'd recommend it :)
In windmill, https://github.com/windmill-labs/windmill (self-hostable AWS Lambda, OSS AGPLv3) we infer the jsonschema of your script by doing static analysis but so far we were not doing validation of the payload itself, if your script failed because of incorrect payload that was your problem. Now without any additional effort I will be able to add validation and great error reporting "for free".
Supabase is actually a great comparable, complementary and inspiration. Ideally, windmill allow tech teams and startups to focus on their business logic of their internal tools, not the infra to run it.
The postgres-json-schema alternative that's mentioned in the repo also ships with what appears to be a conformance test suite; does this carry the same, or was the focus more on speed?
pg_jsonschema is a (very) thin wrapper around https://docs.rs/jsonschema/latest/jsonschema/ it supports versions 4, 6, and 7
> appears to be a conformance test suite; does this carry the same, or was the focus more on speed?
it was mostly an excuse to play with pgx that went surprisingly well. The performance is very good but safety was our main interest.
There are 9 drafts, at least six of which seem to have support in implementations. (Plus there is a different variation [an “extended subset” of Draft 5] defined and used within OpenAPI 3.0.)
Supported extensions on AWS RDS: https://docs.aws.amazon.com/AmazonRDS/latest/PostgreSQLRelea...
Supported extensions on Google Cloud SQL: https://cloud.google.com/sql/docs/postgres/extensions
Supported extensions on Azure Database: https://docs.microsoft.com/en-us/azure/postgresql/single-ser...
Give it a quick try on any Kubernetes cluster, like k3s on your laptop (one command install), and install any extension from the Web Console or a 1-line in the SGCluster yaml.
Supabase team keep doing what you're doing!
good luck building, and make sure to share the critical feedback with us too. it really helps to improve the product
Will do.
Some suggestion for the next roadmap:
- a Dockerfile ( The dockerfile helps me a lot in trying out new technologies )
- info about the compatibility with new PG15
- CI/CD
I mean I don't think it's just 1 line
- this is just a rust part: https://github.com/rust-lang/docker-rust/blob/master/Dockerf...
- plus "postgresql-server-dev-14"
- ???
Some prior art:
- https://github.com/gavinwahl/postgres-json-schema (mentioned in the repo)
- https://github.com/furstenheim/is_jsonb_valid
pgx[0] is going to be pretty revolutionary for the postgres ecosystem I think -- there is so much functionality that would benefit from happening inside the database and I can't think of a language I want to use at the DB level more than Rust.
I must admit that the whole extension ecosystem in Postgres is amazing, even without any additional language layer.
I don’t write C very often but writing Postgres extensions was so easy to approach, it blew my mind!
First of all, this is an exciting use case, I didn't even anticipate it when started `jsonschema` (it was my excuse to play with Rust). I am extremely pleased to see such a Postgres extension :)
At the moment it supports Drafts 4, 6, and 7 + partially supports Draft 2019-09 and 2020-12. It would be really cool if we can collaborate on finishing support for these partially supported drafts! What do you think?
If you'll have any bug reports on the validation part, feel free to report them to our issue tracker - https://github.com/Stranger6667/jsonschema-rs/issues.
Re: performance - there are a couple of tricks I've been working on, so if anybody is interested in speeding this up, feel free to join here - https://github.com/Stranger6667/jsonschema-rs/pull/373
P.S. As for the "Prior Art" section, I think that https://github.com/jefbarn/pgx_json_schema should be mentioned there, as it is also based on `pgx` and `jsonschema`.
I didn't know about the pgx_json_schema (or is_jsonb_valid) and will add both of them to the README today
We are already running a sync process between firestore and postgres. So we can do aggregations on JSON data. At this point it's only a matter of time before we move to superbase
my day to day to do list varies in the number of tasks, but the completion will always be in boolean
[
{
"task": "do Foo",
"completed": False,
},
{
"task": "do Bar",
"completed": False,
},
{
"task": "do Baz",
"completed": False,
},
...
]
Also, what is the issue of schema validation before inserting into the json column, as this is what I'm doing with a small microservice with Redis.