How do you enforce tenant isolation with that method, or prevent unbounded table reads?
We do something similar for our backoffice - just with the difference that it is Claude that has full freedom to write queries.
What are other limitations and mitigations folks have used or encountered to support stability and security? Things like
- Query timeouts to prevent noisy neighbors
- connection pooling (e.g. pgbouncer) also for noisy neighbors
- client schema compatibility (e.g. some applications running older versions, have certain assumptions about the schema that may change over time)You can also limit it by creating read-only replica's and making SELECT's happen on the replica. We don't usually bother, since 99% of our users are employees of ours, we can teach them to not be stupid. Since their usage doesn't change much over time, we can usually just hand them a SQL query and say: here run this instead.
Most of our employees don't even know they have SQL access, it's not like we force people to learn SQL to get their job done. Because of RLS and views, the ones that do SQL don't have to know much SQL, even if they do happen to use it. SELECT * from employees; gets them access to basically all the employee info they could want, but only to the employees they have access to. If you are a manager with 10 people, your select returns only your 10 people.
The payroll staff runs the same query and gets all of the employees they handle payroll for. Since our payroll is done inside of PostgreSQL(thanks plpython[1]), we can do some crazy access control stuff that most systems would never even dream about. Whenever new auditors come in and see that our payroll staff is limited to seeing only the info they need to do payroll, and only for their subset of employees they actually pay, they are awestruck.
The random vendors that can't be taught, we usually hand them a nightly SQLite dump instead. I.e let them pay the CPU cost of their crappy SQL.
Around client schema compatibility. This happens with other models too(API, etc). It's not unique to PG or SQL Databases. You have to plan for it. Since most all of our users interact with views and not with the actual underlying tables, it's not usually that big of a deal. In the extreme cases, where we can't just keep around a view for them, we have to help them along(sometimes kicking and screaming) into a new version.
0: https://www.postgresql.org/docs/current/runtime-config-clien...
Every user gets their own role in PG, so the rest of the PG access control system is also used.
We have your normal SSO system(Azure) and if Tootie employee doesn't need access to Asset Control, they don't get any access to the asset schema for instance.
What would be your method?
You would have some app that your dev team runs that handles access control, so your app gets unrestricted access to the DB. Now your app is the single boundary, and it forces everyone to go through your app. How is that better? It also complicates your queries, with a ton of extra where conditions.
A bunch of bespoke access control code you hope is reliable or a feature of the database that's well tested and been around for a long time. pgtap[0] is amazing for ensuring our access control (and the rest of the DB) works.
If some random utility wants to access data, you either have to do something special access wise, or have them also go through your app(let's hope you have an API and it allows for whatever the special is). For us, that random utility gets SQL access just like everyone else. They get RLS applied, etc. They can be naive and assume they have total control, because when they do select * from employees; they get access to only the employee column and rows we want that utility to have.
We have a bunch of tools over the decades that need access to various bits of our data for reason(s). Rather than make them all do wacky stuff with specialized API's, they just get bog standard PG SQL. We don't have to train vendor Tito how to deal with our stuff, we just hand them their auth info to PG and they can go to town. When people want Excel spreadsheets, they just launch excel, do a data query and their data just shows up magically. All from within Excel, using the standard excel data query tools, no SQL needed.
We trust that Amazon or Google or Microsoft are successful in protecting customer data for example. We trust that when you log into your bank account the money you see is yours, and when you deposit it we trust that the money goes into your account. But it's all just mostly logical separation.
I didn't use the in browser WASM but I did expose an api endpoint that passed data exploration queries directly to the backend like a knock off of what new relic does. I also use that same endpoint for all the graphs and metrics in the UI. Just filtered out the write / delete statements in a rudimentary way.
DuckDB is phenomenal tech and I love to use it with data ponds instead of data lakes although it is very capable of large sets as well.
And "data pond"? Glad I am not alone using this term! Somewhere between a data lake and warehouse - still unstructured but not _everything_ in one place. For instance, if I have a multi-tenant app I might choose to have a duckdb setup for each customer with pre-filtered data living alongside some global unstructured data.
Maybe there's already a term that covers this but I like the imagery of the metaphor... "smaller, multiple data but same idea as the big one".
We’re about to introduce alerts where users can write their own TRQL queries and then define alerts from them. Which requires evaluating them regularly so effectively the data needs to be continuously up to date.
Quadrillions, yeah go find yourself a trino spark pipeline
> Why call it DuckDB?
> Ducks are amazing animals. They can fly, walk and swim. They can also live off pretty much everything. They are quite resilient to environmental challenges. A duck's song will bring people back from the dead and inspires database research. They are thus the perfect mascot for a versatile and resilient data management system.
Just to clarify, the data is prepared when the user (agent) analytics session starts. Right now it takes 5-10s, which means it's typically ready well before the agent has actually determined it needs to run any queries. I think for larger volumes, pg_duckdb would allow this to scale to 10s of millions rows pretty efficiently.
Reason 4 is probably an improvement, but could probably be done with CH functions.
The problem with custom DSLs like this is that tradeoff a massive ecosystem for very little benefit.
Agreed with the ecosystem cons getting much heavier as you move outside the product surface area.
First I need to learn a new (even easy & familiar) language, second I need to be aware of what's proprietary & locks me to the vendor platform. I'd suspect they see the second as a benefit they get IF they can convince people to accept the first.
The main advantages of a DSL are you can expose a nicer interface to users (table names, columns, virtual columns, automatic joins, query optimization).
We very intentionally kept the syntax as close to regular ClickHouse as possible but added some functions.
This sounds solvable with clickhouse views?
> automatic joins
Is this also not solvable with views? Also, clickhouse heavily discourages joins so I wonder how often this winds up being beneficial? For us, we only ever join against tenant metadata (i.e. resolving ID to name)
> query optimization
This sounds potentially interesting - clickhouse's query optimizer is not great IME, but it's definitely getting better
row level access control, resource quotas, scheduling policies, session settings, etc. all could have been used in concert to achieve a very similar outcome with a dozen or so ddl/dcl statements.
The DSL approach has other advantages too: like rewriting queries to not expose underlying tables, doing automatic performance optimizations…
It’s a sql that compiles to the real database sql based on configuration.
For query operations I would try to find a way to solve this with tools like S3 and SQLite. There are a few VFS implementations for S3 and other CDNs.
We (https://prequel.co) recently started offering this as a white labeled capability so anyone can offer it without building it yourself. Its a newer capability to our export product where instead of sending the data to the tenant's data warehouse, we enable you to provision an S3/GCS/ABS/etc bucket with the data formatted. Credential management, analytics, etc is all batteries included so you don't have to do that either. The initial interest from our customers was around BI integrations but agent use is starting to pick up which is kinda interesting to see.
We use it (I’m the author or the article) so users can search every run they do and graph all sorts of metrics.