undefined | Better HN

0 pointsTostino4y ago0 comments

That's a quite simple query to just about any database.

0 comments

5 comments · 2 top-level

jiggawatts4y ago· 3 in thread

Is it though? This goes back to my point of architects and developers having internalised thoroughly outdated rules of thumb that are now wrong by factors of tens of thousands or more.

This is not a simple problem to solve efficiently using traditional RDBMS query APIs because they're all rooted in 1980s thinking of: "The network is fast, and this is used by human staff doing manual data entry into a GUI form."

Let's say you're writing an "app" that's given a list of, say, 10K numbers to check. You have a database table in your RDBMS of choice with a column of "banned phone numbers". Let's say it is 100 million numbers, so too expensive to download in bulk.

How would you do this lookup?

Most programmers would say, it's an easy problem to solve: Make sure there is a unique index on that column in the database, and then for each row in the input run a lookup such as:

    SELECT 1 FROM BadNumbers WHERE PhoneNumber = @numbertocheck

So simple. So fast!

Okay, that's 10K round-trips on the network, almost certainly crossing a firewall or two in the process. Now it'll take minimum of 1 millisecond per call, more like 2ms[1], so that's at least 20 seconds of wait time for the user to process mere kilobytes of data.

Isn't that just sad? A chunk of a minute per 100KB of data.

Like I'm saying, nobody has internalised just how thoroughly Wrong everything is top-to-bottom. The whole concept of "send a query row-by-row and sit there and wait" is outdated, but it's the default. It's the default in every programming language. In every database client. In every ORM. In every utility, and script, sample, and tutorial. It's woven throughout the collective consciousness of the IT world.

The "correct" solution would be for SQL to default to streaming in tables from the client, and every such lookup should be a streaming join. So then the 100KB would take about 5 milliseconds to send, join, and come back, with results coming back before the last row is even sent.

PS: You can approximate this using table-valued parameters in some RDBMS systems, but they generally won't start streaming back results until all of the input has arrived. Similarly, you can encode your table as JSON and decode it on the other end, but that's even slower and... disgusting. The Microsoft .NET Framework has a SqlBulkCopy class but it has all sorts of limitations and is fiddly to use. But that's my point. What should be default case is being treated as the special case because decades ago it was.

[1] If you're lucky. But luck is not a strategy. What happens to your "20 seconds is not too slow app" when the database fails over the paired cloud region? 1-2 ms is now 15 ms and so those 100K round trips will cost two and a half minutes.

TostinoOP4y ago

Yes, it is.

While I agree that databases could absolutely be improved to make streaming query results as described better, that isn't a limiting factor here IMO. I'd tackle that problem by batching my queries to the database into some logical batch size and send them as table valued parameters. If I had 10k phone numbers to check and minimum latency matters, why not batch into queries of 500-1000 values per-query? That cuts down time to first response, while reducing the network roundtrips.

The issue with taking this out of the database, is you lose consistency. I don't know about your industry, but I don't think mine would be terribly happy if I was using stale data to validate my Do Not Call/Email list. Now there are some situations where you can just update your list of numbers nightly/weekly/monthly, etc. If you don't need any concurrency or other guarentees, might as well save the time/resources on your DB server.

That's just not the world I have worked in.

treeman794y ago

Brilliant explanation.

If I was to rebuild that python script application today.

to try and match 100,000 records against 10 million. If I were to do it in a database driven micro architecture solution. I’m not sure if I could come up with that returns results faster even using up probably a million times more clock cycles.

TostinoOP4y ago

Yes, that is a good explainer on the horrors of single value lookups to a database. It isn't the only way to do that though, as explained in my other post.

I absolutely agree that a DB (even an extremely efficient one) is going to use many more clock cycles to return results there than a local data structure + application. No questions asked.

But how is that list kept up to date? If a user wants to be placed on the list, do you do that in real time to your local data structure? Do you wait for a batch process to sync that data structure with your source of truth?

I'm just saying that a simple program like that will be faster because it lacks a lot of the features that people would consider necessary in today's world.

The database is an amazing general tool that can be used to tackle whole classes of problems that used to require specialized solutions.

treeman794y ago

Keep in mind this was 20 years ago.

A couple years before this project I thought SQL sounded hard. So I wrote my own database engine.

j / k navigate · click thread line to collapse

0 comments

5 comments · 2 top-level

jiggawatts4y ago· 3 in thread

Is it though? This goes back to my point of architects and developers having internalised thoroughly outdated rules of thumb that are now wrong by factors of tens of thousands or more.

How would you do this lookup?

Most programmers would say, it's an easy problem to solve: Make sure there is a unique index on that column in the database, and then for each row in the input run a lookup such as:

    SELECT 1 FROM BadNumbers WHERE PhoneNumber = @numbertocheck

So simple. So fast!

Isn't that just sad? A chunk of a minute per 100KB of data.

TostinoOP4y ago

Yes, it is.

That's just not the world I have worked in.

treeman794y ago

Brilliant explanation.

If I was to rebuild that python script application today.

TostinoOP4y ago

Yes, that is a good explainer on the horrors of single value lookups to a database. It isn't the only way to do that though, as explained in my other post.

I absolutely agree that a DB (even an extremely efficient one) is going to use many more clock cycles to return results there than a local data structure + application. No questions asked.

I'm just saying that a simple program like that will be faster because it lacks a lot of the features that people would consider necessary in today's world.

The database is an amazing general tool that can be used to tackle whole classes of problems that used to require specialized solutions.

treeman794y ago

Keep in mind this was 20 years ago.

A couple years before this project I thought SQL sounded hard. So I wrote my own database engine.

j / k navigate · click thread line to collapse