In my last startup, we were building a chargeback management tool for fintechs, and essentially whenever a chargeback came in, we'd only get the transaction ID, so we'd have to query the fintech's database to search for that transaction ID and pull additional data points like name of seller, email of seller, etc.
Now, when we did this, it was more of a scrappy workaround to the fintech not being able to dedicate engineering resources to send data to an API endpoint of ours. However, after a while, I thought to myself, there's nothing wrong with querying the fintech's DB as long as we're treating infosec seriously, and so now I'm curious to hear how common this practice is.
Maybe it's just me, but it feels like integrating with internal DBs could allow a startup to onboard and go live with customers much more quickly. For some reason though, I haven't heard much of this around, so to be completely honest, I'm thinking of starting a tool to make this process easier / standardised.
Before that though, I'm trying to wrap my head around the use cases for this so it'd be amazing to hear about your story of integrating directly with a customer's DB -- what was it for and how was the experience!
This is an unfortunate attitude.
From a software architecture point of view, you're hard-coupling your software to that database which can create brittleness: You're at the mercy of the original database schema which may not be optimal, carries it's own tech debt, may require translations to your own implementation, and may cause mistakes in interpretation in anything of scale. Even worse, you're at the customer's mercy if they want to change anything which immediately breaks your code (likely without warning) and now you have to fix it (likely under duress).
Letting an outside org get direct access to a database is an unnecessary security risk, even if you are "treating infosec seriously". I'd argue no organization that treats infosec seriously would want to request access to or grant access to anything more than the minimal information they need. https://en.wikipedia.org/wiki/Principle_of_least_privilege
Modern software design has found putting up abstractions and interfaces to address those issues which is why it's not particularly common. In fact, one fintech I worked with did this and ruined a team's holiday weekend by making an unfortunate and unsignaled change late on a Friday.
SQL has views. As long as you can represent the underlying data through a view you should be fine.
> and may cause mistakes in interpretation in anything of scale
All APIs have this concern.
> direct access to a database is an unnecessary security risk
You can constrain access to be read-only and to only access the particular view or views necessary.
> has found putting up abstractions and interfaces to address those issues
It could just be a crutch to avoid deep understanding of the underlying tools themselves. It's much easier to greenfield a new API. It's much easier to forget and accidentally abandon these APIs.
This, and it doesn't have to be SQL, a number of other databases have similar concepts.
If you have a separate schema with views (in SQL terms), and the rest is either entirely walled off or available on a "you have been warned" basis, then this is simply an API with a clear contract, just a flexible one (not entirely unlike a GraphQL endpoint) and available over a less common (for APIs) protocol.
It can be risky for the API provider, though, unless it's a proper multi-tenant database. Given that those are less common, and, I guess, potentially harder to manage, it's probably why database access APIs are rare.
Who cares anyway? It's fintech. These projects are meant to become legacy and be replaced every 5-10 years. Those that don't are a PITA to work with regardless of how much the data layer is decoupled from the business layer.
And even if you do decouple them, these *** projects WILL find a way to break on a friday night anyway. Because of course they will. For the most part, they're written by juniors on tight deadlines and there's like only two people in the entire org who even have a faint clue of what the entire codebase does. You can use all the design principles and patterns you want, they're not going to save you.
To argue against both the points you made though, there are reverse ETL platforms like Hightouch and Census today which 1. have specific architectures to make data movement from a central warehouse to third-party SaaS platforms seamless and easy, and 2. also access the company's data stores directly. What's the difference between what they're doing, and this hypothetical third-party unifying data integrations?
You might go live quicker. But the integration will break every time your customer makes a change or needs to upgrade. Pray there's documentation. You'll spend unending hours debugging your customers' weird data models. Issues will appear when logic changes but the data model doesn't.
I won't tell you not to do this because you've been so light on details of what you're actually doing (maybe it is easier this way!), but it would not fly on my watch.
There is no situation in which I'd let a customer write their own queries against the main shared production database, for performance reasons alone. It's a complete non-starter.
(Although like you say, if they want to pay for their own dedicated private read replica, and deal with breaking schema changes whenever they arise, let them go at it. That's sure not something I'd want as a customer though.)
I agree - this is a very good approach with the most amount of control around how often you need fresh data, how to trigger that etc.
Although, not every integration is customer-facing and does not have to be infinitely scalable. Scalability = cost, often a heavy cost. If you query the DB once a day, the customer can come back to you and say: our db's are a little hot at 3AM and you do your queries then - can you add some indexes to the view or run a DESCRIBE query and optimize this a bit?
And let's agree that the above point does not change whether you use an API layer or query directly.
You're either putting that cost on the client or paying to replicate their DB yourself. If that's too expensive, you shouldn't be using this approach in the first place! Running arbitrary queries on your customer's database primary is simply never safe, nor can it be made safe.
> If you query the DB once a day, the customer can come back to you and say: our db's are a little hot at 3AM and you do your queries then - can you add some indexes to the view or run a DESCRIBE query and optimize this a bit?
It's not a question of the DB running a little hot. It's accidentally triggering one or more full table scans on a table with millions of records. You have no idea what time your customer experiences load, or when their own batch jobs run. Or whether the index you built against with EXPLAIN six months ago got dropped because there's no queries in your customer's codebase that could possibly reference it.
And it's not just about their system staying up. What if you trigger so many IOPS that your customer has a massive bill? Or if you run a query that needs more space to prepare the results, which scales up the instance automatically, resulting in unexpected costs? You're gonna pay for that, right?
You also might hit a performance cliff: the query planner might choose a bad plan for a query that's otherwise satisfiable with an index only, because the stats suggest (for instance) that the visibility map is stale. There's nothing you can do here to fix this short of tuning settings on the DB itself.
> And let's agree that the above point does not change whether you use an API layer or query directly.
It does! Your customer (hopefully) knows their own DB. You don't. Have your customer build the API and they'll support your integration. You can't possibly have a better coupling with the data in your customer's database than your customer does.
I'd hope to never work at a company like that again as SRE, my life was nightmare.
This avoids a situation where we get an email out of the blue explaining that there is a database change and we need to dedicate engineering resources to make sure it is compatible by a certain date, or even worse and more likely, an urgent email explaining that the integration has broke and we need it back running last week.
-> Permission control, making sure that the user of API can not see data they are not supposed to.
-> Auditability. Verify that the API is being used correctly.
-> Performance. Do not overload the endpoint. (Read from a read replica? And maybe you are not running hour-long analytics queries on the database)
-> Consistency. Are you using transactions to read your data? Could you be causing contention?
-> API versioning. How do you upgrade the underlying tables without breaking the users.
For some of our outgoing files at $employer, there's a notice period of... I think I remember it being 30 days for additional code values and 90 days for layout changes. That sort of planning ahead becomes much harder if every single schema change is immediately visible to outsiders.
The problems are in the realm of non-functional requirements. Quotas, security, protocol support, asynchronous batch and bulk data access, granular access control, throttling…you can try to do this with your db, and many have, but the square peg will eventually not fit anymore.
I've worked on systems that use a SQL DB as a communication layer. A main advantage is that every language comes with good SQL read/write libraries, so things just work right away.
In terms of Pros: - It was quick to set up and _seems_ to work, mostly
For Cons: - We have no guarantees that this query will continue to work - We have no understanding of the model behind this data (why are these records in this state, and these in another? No idea, and the vendor doesn't seem to understand the question) - We need to poll this data source. How often? No idea!
Culturally I think it suggests an immature tech organisation in general, so while not a red flag I'd suggest it is an orange flag.
Option 1 seems like a lot of work for the customer ... probably not viable.
Option 2: you can securely connect to their data-source using oAuth2.0. In fact I know MongoDB offers this (https://www.mongodb.com/docs/atlas/security-oidc/) for this exact use-case I believe. I recall using Vercel to integrate with my MongoDB.
MongoDB: https://www.mongodb.com/docs/atlas/security-oidc/ Supabase: https://supabase.com/docs/guides/auth
Though I haven't done it directly, I think integrating with the DB through a secure protocol like oAuth is the way to go. DB services offer this and must be for this exact use case.
If the SaaS target customers are smaller companies / startups, then this may be viable
I'd love to hear more about your idea though! would you mind sharing a little bit about it?
I’m your target audience.
The #1 issue I have with this concept is, as a customer, I would have to trust that you understand the performance limits of my production database and that you won’t run (even accidentally) a burst of inefficient queries that takes down or impairs production.
One way around this would be to query against a replica / secondary, but even then you risk overloading a secondary/replica, which could have negative effects in a failover scenario.
The idea is a simple marketing tool to help businesses reach out to customers that didn’t convert fully through the funnel. So for example, user created a profile, but didn’t end up purchasing.
If interested, I can share the link (once it’s up) and would love to get your feedback.
Unfortunately, the longer into the project you are, the more brittle that connection becomes. You’re tightly coupling a process to the state of the database at time of build. When you do that, you create a situation where your customers can kill your code.
So fundamentally, when you integrate directly, you create a situation where your customers’ engineers control how reliable your software can be. They likely won’t fess up when they break your tool so the problems will come back to you.
I know that sounds a lot like typing ‘npm install’ or ‘building an API’ but in this case, you’re handing the keys over to your customers. At best, it will cause reliability problems. At worst, it will add friction to a sales process. As companies scale and sales gains power, that can become a career issue for you. As a rule, it’s better to have difficult technical decisions when you’re employed than difficult financial decisions when you’re not.
So good writing and excellent analysis. But if you choose to go down that path, at some point in the future, I feel like that integration will become someone’s headache.
If you have to do it, it would be worth documenting that the customer in this case won’t provide proper access through an API. That’s a good reason to do something like this because that’s quite unreasonable of the customer. If they’re unwilling to provide API access, I feel like they’re just as unlikely to provide a read replica so aside from the integration’s brittleness, your queries have to be rock solid or you could slow down prod.
It’s all risky and you have a big decision to make. Good luck and have fun.
It works surprisingly well, and is pretty resilient since it essentially acts as a message queue too. Then again, this is all low traffic stuff in a SME / B2B setting, with zero multitenancy involved.
I’ll still take a proper API or message queue any day though.
Edit: I suppose our biggest benefit is that our customer can actually change the interface with us fairly quickly. Then have database experts in-house, but devs who could do APIs. So the collaboration has been mostly smooth an account of that, and that is a huge advantage compared to them having to outsource any API work. Technically speaking I’m not a fan, but the non-technical results have been useful.
If you're doing writes, this is potentially harmful to the customer, too, not just data exfiltration but also potentially software breaking and that could be lawsuit territory.
Even doing this with your own services' data stores is bad practice. Direct reads/writes to a service's data store without going through its defined interfaces means unexpected, often unmonitored changes happening. I strongly advise against this pattern of "secret APIs," as opposed to overt ones.
Also, please note that this applies to any data store and not just a DB.
- Customer buys a service from a vendor
- Customer creates a Vnet peering in Azure between their subscriptions
- Customer queries the vendor's DB's directly
I will not name names for the vendors or customers, but you have heard of all of them.
The OP already said that their customer didn't have the resource to even connect to an API, so what on earth makes you think they're gonna be able to create some highly-abstrated API layer or some other "bEsT PrAcTiCe" way of exposing the data?
And even if they did create a view or API for the database instead of giving direct access, what makes you think they'd keep that view or API up-to-date? It's just as likely to break as a changing database schema is.
There's an extremely real chance that this fintech giving you direct DB access was breaking the law by doing so.
Of course, no one will dedicate time or money to that API until something breaks. Then you pull out all the emails, risk/assumption statements etc to highlight your due-diligence on informing the client. Then fix the problem, pushing the whole issue down the road again until the next time.
But seriously, expect to see this anywhere there are budget and time constraints. As soon as the fintech is big enough to get more worried about risk than growth, they should start taking this kind of thing seriously.
But in reality, any big corp assumes if you have the access, you’ve used it. So it should be pretty common
Microsoft’s PowerBI and Google has some tools included in G Suite
Tableau has a free tier I think
At Scramjet we've built an Platform+Agent engine and put it to use and in many cases we do integrate into the DB's directly without APIs through long-lived integration functions.
Let me address some of the comments below:
- hard coupling of DB is bad - I really don't see how adding an additional middleware results in lessening this, you're hard coupling into something else. While it might help if a database structure is changed on a whim but, from my experience, most of the schema changes reflect changed requirements.
- hard couping is bad in general - is it though? As a platform engineer I went into a pitfall of making everything reusable only to learn that 90% of the stack has never changed and the XML-based protocol projects developed in 2004 are still being developed without change to the protocol at all. A reactive approach with a separated client/data model is much better in my opinion and quite frankly it's mostly impossible to write a system these days without one.
- schema changes are unavoidable - but hey, as someone pointed out, there are views and materialised views. The RDBs do offer a good framework for keeping legacy schemas during the development cycle and this framework will save you the most money as it affects the least amount of moving parts in your system. It's there since the 90s at least, so why not use it?
- the security mumbo jumbo - this seems to be the most uninformed opinion - all leading RDBMS have an extremely robust and granual way to grant access to the data. They are battle tested for nearly half of a century. They are well known and easily tested. And, unlike the REST frameworks, HTTP servers, relays, caches and what not, those are contained in a single piece of software, limitting the attack vectors.
If you can deploy the client to the DB remotely, near the database then I'd say - you're doing this right and not only that, you're limitting the amount of code you need to deploy, maintain and depend on. If you separate the data, the access code and the logic from each other, you have achieved decoupling. But that does not mean that REST API's are the "one correct way".
Such an approach was the underlying idea of our Scramjet Platform and we've taken this approach not only in integrating into DBs, but also file storage, direct protocols and even GPIOs in some cases. And though we have a trick up our sleve of being able to split the code across distributed environments, even in general terms I'd say: it works, it's maintainable and way less expensive.
The technical debt you'd accrue from this would be MASSIVE, and screw your entire architecture down to its roots. Hard pass from me.