[1] - https://adamdrake.com/command-line-tools-can-be-235x-faster-...
b) How large are the 1% of the feeds and the size of the total joined datasets. Because ultimately that is what you build platforms for. Not the simple use cases.
Hot damn, we collectively spent so much time mitigating our misuse & abuse of DynamoDB.
I'm assuming it's better now?
The CRan packages are all high quality if the maintainer stops responding to emails for 2 months your package is automatically removed. Most packages come from university Prof's that have been doing this their whole career.
With a database it is difficult to run a query, look at the result and then run a query on the result. To me, that is what is missing in replacing pandas/dplyr/polars with DuckDB.
The role of a database is not just to deliver query performance. It needs to fit into the ecosystem, serve the overall role on multiple facets, deliver on a wide range of expectations - tech and non-tech.
While the useful dataset itself may not outpace the hardware advancements, the ecosystem complexity will definitely outpace any hardware or AI advancements. Overall adaptation to the ecosystem will dictate the database choice, not query performance. Technologies will not operate in isolation.
Back in 2012 we were just recovering from the everything-is-xml craze and in the middle of the no-sql craze and everything was web-scale and distribute-first micro-services etc.
And now, after all that mess, we have learned to love what came before: namely, please please please just give me sql! :D
NoSQL e.g. Cassandra, MongoDB and Microservices were invented to solve real-world problems which is why they are still so heavily used today. And the criticism of them is exactly the same that was levelled at SQL back in the day.
It's all just tools at the end of the day and there isn't one that works for all use cases.
I like the gist of the article, but the conclusion sounds like 20/20 hindsight.
All the elements were there, and the author nails it, but maybe the right incentive structure wasn't there to create the conditions to make it able to be done.
Between 2010 and 2015, there was a genuine feeling from almost all industry that we would converge to massive amounts of data, because until this time, the industry had never faced a time with so much abundance of data in terms of data capture and ease of placing sensors everywhere.
The natural step in this scenario won't be, most of the time, something like "let's find efficient ways to do it with the same capacity" but instead "let's invest to be able to process this in a distributed manner independent of the volume that we can have."
It's the same thing between OpenAI/ChatGPT and DeepSeek, where one can say that the math was always there, but the first runner was OpenAI with something less efficient but with a different set of incentive structures.
Is only after being burned many many times that arise the need for simplicity.
Is the same of NoSql. Only after suffer it you appreciate going back.
ie: Tools like this circle back only after the pain of a bubble. It can't be done inside it
Investors really wanted to hear about your scaling capabilities, even when it didn't make sense. But the burn rate at places that didn't let a spreadsheet determine scale was insane.
Years working on microservices, and now I start planning/discovery with "why isn't this running on a box in the closet" and only accept numerical explanations. Putting a dollar value on excess capacity and labeling it "ad spend" changes perspectives.
https://www.usenix.org/system/files/conference/hotos15/hotos...
Anyway, the old laptop is about par with the 'big' VMs that I use for work to analyse really big BQ datasets. My current flow is to do the kind of 0.001% queries that don't fit on a box on BigQuery and massage things with just enough prepping to make the intermediate result fit on a box. Then I extract that to parquet stored on the VM and do the analysis on the VM using DuckDB from python notebooks.
DuckDB has revolutionised not what I can do but how I can do it. All the ingredients were around before, but DuckDB brings it together and makes the ergonomics completely different. Life is so much easier with joins and things than trying to do the same in, say, pandas.
The screen started to delaminate on the edges, and its follow-up (a MBP with the touch bar)'s screen is completely broken (probably just the connector cable).
I don't have a use for it, but it feels wasteful just to throw it away.
What puzzled me was that a client would want others to execute its queries, but not want to load all the data and make queries for the others. And how to prevent conflicting update queries sent to different seeds.
I also thought that Crockford's distributed web idea (where every page is hosted like on torrents) was a good one, even though I didn't think deep of this one.
Until I saw the discussion on web3, where someone pointed out that uploading any data on one server would make a lot of hosts to do the job of hosting a part of it, and every small movement would cause tremendous amounts of work for the entire web.
I would definitely not trade that for a pre-computed analytics approach. The freedom to explore in real time is enlightening and freeing.
I think you have restricted yourself to recomputed fix analytics but real time interactive analytics is also an interesting area.
This isn't really saying much. It is a bit like saying the 1:1000 year storm levy is overbuilt for 99.9% of storms. They aren't the storms the levy was built for, y'know. It wasn't set up with them close to the top of mind. The database might do 1,000 queries in a day.
The focus for design purposes is really to queries that live out on the tail - can they be done on a smaller database? How much value do they add? What capabilities does the database need to handle them? Etc. That is what should justify a Redshift database. Or you can provision one to hold your 1Tb of data because red things go fast and we all know it :/
On the contrary, it's saying a lot about sheer data size, that's all. The things you mention may be crucial why Redshift and co. have been chosen (or not - in my org Redshift was used as standard so even small dataset were put into it as the management want to standardize, for better or worse), but the fact remains that if you deal with smaller datasets all of the time, you may want to reconsider the solutions you use.
I'm throwing a bottle into the ocean: if anyone has spare compute with good specs they could lend me for a non-commercial project it would help me a lot.
My email is in my profile. Thank you.
Also - sqlite would have been totally fine for these queries a decade ago or more (just slower) - I messed with 10GB+ datasets with it more than 10 years ago.
* you have a small datasets (total, not just what a single user is scanning)
* no real-time updates, just a static dataset that you can analyze at leisure
* only few users and only one doing any writes
* several seconds is an OK response time, get's worse if you have to load your scanned segment into DuckDB node.
* generally read-only workloads
So yeah, not convinced we lost a decade.
I'm really skeptical arguments that say it's OK to be slow. Even on the modern laptop example queries still take up to 47 seconds.
Granted, I'm not looking at the queries but the fact is that there are a lot of applications where users need results back in less than a second.[0] If the results are feeding automated processes like page rendering they need it back in 10s of millisecond at most. That takes hardware to accomplish consistently. Especially if the datasets are large.
The small data argument becomes even weaker when you consider that analytic databases don't just do queries on static datasets. Large datasets got that way by absorbing a lot of data very quickly. They therefore do ingest, compaction, and transformations. These require resources, especially if they run in parallel with query on the same data. Scaling them independently requires distributed systems. There isn't another solution.
[0] SIEM, log management, trace management, monitoring dashboards, ... All potentially large datasets where people sift through data very quickly and repeatedly. Nobody wants to wait more than a couple seconds for results to come back.
It's a lot easier to monetize data analytics solutions if users code & data are captive in your hosted infra/cloud environment than it is to sell people a binary they can run on their own kit...
All the better if its an entire ecosystem of .. stuff.. living in "the cloud", leaving end users writing checks to 6 different portfolio companies.
Remember, from 2020-2023 we had an entire movement to push a thing called "Modern data stack (MDS)" with big actors like a16z lecturing the market about it [1].
I am originally from Data. Never worked with anything out of the Data: DS, MLE, DE, MLOps and so on. One thing that I envy from other developer careers is to have bosses/leaders that had battle-tested knowledge around delivering things using pragmatic technologies.
Most of the "AI/Data Leaders" have at maximum 15-17 years of career dealing with those tools (and I am talking about some dinosaurs in a good sense that saw the DWH or Data Mining).
After 2018 we had an explosion of people working in PoCs or small projects at best, trying to mimic what the latest blog post from some big tech company pushed.
A lot of those guys are the bosses/leaders today, and worse, they were formed during a 0% interest environment, tons of hype around the technology, little to no scrutiny or business necessity for impact, upper management that did not understand really what those guys were doing, and in a space that wasn't easy for guys from other parts of tech to join easily and call it out (e.g., SRE, Backend, Design, Front-end, Systems Engineering, etc.).
In other words, it's quite simple to sell complexity or obscure technology for most of these people, and the current moment in tech is great because we have more guys from other disciplines chime in and share their knowledge on how to assess and implement technology.
[1] - https://a16z.com/emerging-architectures-for-modern-data-infr...
OK now you need PortCo1's company analytics platform, PortCo2's orchestration platform, PortCo3's SRE platform, PortCo4's Auth platform, PortCo5's IaC platform, PortCo6's Secrets Mgmt Platform, PortoCo7's infosec platform, etc.
I am sure I forgot another 10 things. Even if some of these things were open source or "open source", there was the upsell to the managed/supported/business license/etc version for many of these tools.
As an SRE/SysEng/Devops/SysAdmin (depending on the company that hires me): most people in the same job as me could easily call it out.
You don't have to be such a big nerds to know that you can fit 6TB of memory in a single (physical) server. That's been true for a few years. Heck, AWS had 1TB+ memory instances for a few years now.
The thing is... Upper management wanted "big data" and the marketing people wanted to put the fancy buzzword on the company website and on linkedin. The data people wanted to be able to put the fancy buzzword on their CV (and on their Linkedin profile -- and command higher salaries due to that - can you blame them?).
> In other words, it's quite simple to sell complexity or obscure technology for most of these people
The unspoken secret is that this kind of BS wasn't/isn't only going on in the data fields (in my opinion).
There is some circular reasoning embedded here. I've seen many, many cases of people finding ways to cut up their workloads into small chunks because the performance and efficiency of these platforms is far from optimal if you actually tried to run your workload at its native scale. To some extent, these "small reads" reflect the inadequacy of the platform, not the desire of a user to run a particular workload.
A better interpretation may be that the existing distributed architectures for data analytics don't scale well except for relatively trivial workloads. There has been an awareness of this for over a decade but a dearth of platform architectures that address it.
Data size is a red herring in the conversation.
Around 0.5 to 50 GB is such an annoying area, because Excel starts falling over on the lower end and even nicer computers will start seriously struggling on the larger end if you're not being extremely efficient.
Me, circa 2007, when lovely hadoop guy trying to tell me how good it was for 10m rows of data... It was much slower than Oracle free version as I recall.
Why do they use the geometric mean to average execution times?
If half your requests are 2x as long and half are 2x as fast, you don’t take the same wall time to run — you take longer.
Let’s say you have 20 requests, 10 of type A and 10 of type B. They originally both take 10 seconds, for 200 seconds total. You halve A and double B. Now it takes 50 + 200 = 250 seconds, or 12.5 on average.
This is a case where geometric mean deceives you - because the two really are asymmetric and “twice as fast” is worth less than “twice as slow”.
I just did a quick google and first real result was this blog post with a good explanation with some good illustrations https://jlmc.medium.com/understanding-three-simple-statistic...
Its the very first illustration at the top of that blog post that 'clicks' for me. Hope it helps!
The inverse is also good: mean-square-error is the good way for comparing how similar two datasets (e.g. two images) are.
I mostly see clickhouse,postgress etc