This is not a simple problem to solve efficiently using traditional RDBMS query APIs because they're all rooted in 1980s thinking of: "The network is fast, and this is used by human staff doing manual data entry into a GUI form."
Let's say you're writing an "app" that's given a list of, say, 10K numbers to check. You have a database table in your RDBMS of choice with a column of "banned phone numbers". Let's say it is 100 million numbers, so too expensive to download in bulk.
How would you do this lookup?
Most programmers would say, it's an easy problem to solve: Make sure there is a unique index on that column in the database, and then for each row in the input run a lookup such as:
SELECT 1 FROM BadNumbers WHERE PhoneNumber = @numbertocheck
So simple. So fast!Okay, that's 10K round-trips on the network, almost certainly crossing a firewall or two in the process. Now it'll take minimum of 1 millisecond per call, more like 2ms[1], so that's at least 20 seconds of wait time for the user to process mere kilobytes of data.
Isn't that just sad? A chunk of a minute per 100KB of data.
Like I'm saying, nobody has internalised just how thoroughly Wrong everything is top-to-bottom. The whole concept of "send a query row-by-row and sit there and wait" is outdated, but it's the default. It's the default in every programming language. In every database client. In every ORM. In every utility, and script, sample, and tutorial. It's woven throughout the collective consciousness of the IT world.
The "correct" solution would be for SQL to default to streaming in tables from the client, and every such lookup should be a streaming join. So then the 100KB would take about 5 milliseconds to send, join, and come back, with results coming back before the last row is even sent.
PS: You can approximate this using table-valued parameters in some RDBMS systems, but they generally won't start streaming back results until all of the input has arrived. Similarly, you can encode your table as JSON and decode it on the other end, but that's even slower and... disgusting. The Microsoft .NET Framework has a SqlBulkCopy class but it has all sorts of limitations and is fiddly to use. But that's my point. What should be default case is being treated as the special case because decades ago it was.
[1] If you're lucky. But luck is not a strategy. What happens to your "20 seconds is not too slow app" when the database fails over the paired cloud region? 1-2 ms is now 15 ms and so those 100K round trips will cost two and a half minutes.
While I agree that databases could absolutely be improved to make streaming query results as described better, that isn't a limiting factor here IMO. I'd tackle that problem by batching my queries to the database into some logical batch size and send them as table valued parameters. If I had 10k phone numbers to check and minimum latency matters, why not batch into queries of 500-1000 values per-query? That cuts down time to first response, while reducing the network roundtrips.
The issue with taking this out of the database, is you lose consistency. I don't know about your industry, but I don't think mine would be terribly happy if I was using stale data to validate my Do Not Call/Email list. Now there are some situations where you can just update your list of numbers nightly/weekly/monthly, etc. If you don't need any concurrency or other guarentees, might as well save the time/resources on your DB server.
That's just not the world I have worked in.
If I was to rebuild that python script application today.
to try and match 100,000 records against 10 million. If I were to do it in a database driven micro architecture solution. I’m not sure if I could come up with that returns results faster even using up probably a million times more clock cycles.
I absolutely agree that a DB (even an extremely efficient one) is going to use many more clock cycles to return results there than a local data structure + application. No questions asked.
But how is that list kept up to date? If a user wants to be placed on the list, do you do that in real time to your local data structure? Do you wait for a batch process to sync that data structure with your source of truth?
I'm just saying that a simple program like that will be faster because it lacks a lot of the features that people would consider necessary in today's world.
The database is an amazing general tool that can be used to tackle whole classes of problems that used to require specialized solutions.
A couple years before this project I thought SQL sounded hard. So I wrote my own database engine.