Presto can be pointed at a lot of data stores but few external data providers offer ODBC-like interfaces. It seems to be either APIs or static file dumps for the most part. So Presto isn't going to be able to pull from these datasets alone.
In terms of security and maintenance, products like Redshift are easier to train traditional data warehouse people up on. The service is relatively cheap and has a nice UI for scaling.
The data world is extremely fragmented. Once firms have something in place changing it is going to be a struggle. Existing staff often gate keep and defend whatever technology they've staked their careers on. Once there are a lot of reports setup with any on data source migrating it could end up becoming a prolonged project which can be hard to sell.
It was quoted Snowflake had a $1M / day budget for sales and marketing. I'm not aware of any Presto consultancy spending that sort of money. Amazon does have Athena but they have countless other offerings which muddies the water.
Thanks for your insights. Great points on inertia and lack of big sales budgets. Agreed also on HDFC/data lake use cases with PQ files. However, regarding querying RDBMS, are you saying that Presto requires in ODBC/JDBC connectivity? Does Presto have an ability to connect with "native DB" drivers?
If looking at Presto (now Trino), the main thing to keep in mind is that you inherit the limitations of the underlying data store.
Its best when the underlying store (+ the Db adaptor implementation) lets you parallelize work and keep each node busy, and avoid processing data unnecessarily. Hive/S3 columnar format data works great for this (IIRC this was a major early use case). Other sources like RDBMS will have natural limitations. Kafka has its own issues since each query generally means re-scanning a topic, etc.
I see the data bridges as most useful as a way to bring data into the native/optimal format. Then do the heavy lift work in Presto.
In the case of a RDBMS can you get performance gains if you try to parallelize a query from many clients? It will depend on the DB adapter and query. In a random case, if you slice a query into N shards it’s not necessarily going to go faster. It’s still the same DB underneath bound by the same HW performance boundaries.
* Two different Prestos, prestodb and prestosql for maximum confusion. (I think one renamed)
* Making Controller highly available by default is hard
* Autoscaling workers is not simple
* Code very dependent on its own webframework that tries to do everything and lacks docs.
* Resource planner for multiple queries is lacking
* Worker configuration takes a lot of skill
All of these could be solved, but in most cases you can find other solutions where you get a simpler set of problems.* It's definitely confusing but pretty common in open source projects to see the original creators split off when corporate oversight interferes with the OS governance model. (https://www.computerworld.com/article/2746627/hudson-devs-vo...). This is especially true when, as the OP mentioned, it's a pretty cool tech and a lot of interest in it. Now that the names are different, it is clearing up a bit. We're hoping in a few years there will be one project standing so that you won't have to choose. I don't have to tell you which one I think it is.
* Active-active HA is not really necessary IMO as Trino is designed for low latency interactive queries in general. It can handle longer running batch queries but it gives up fault tolerance to fail fast and you just resubmit the query vs predecessors like Hive, Spark, etc... that handle ETL and long running batch processes efficiently but this adds complexity to the query to checkpoint the work. I could see the need for an active-passive HA to have on deck during a failure. Setting up your own active-passive HA is as simple as putting two coordinators behind a proxy and pointing your workers to the proxy address. Then you basically have the proxy run health checks and flip over in the event of an outage. Here's the issue to track native HA though https://github.com/trinodb/trino/issues/391.
* I'm not sure why autoscaling is said to be difficult. I think this is why you have kubernetes and docker to manage this type of workload.
* The only reason this is a pain to me is that engineers wanting to join our community and commit have a bit of a learning curve and depends heavily on us mentoring and guiding them on how the REST API works, which we don't mind. However, I agree with this choice from a design perspective for the user. If you want to use Trino, it's better not to be exposed to this implementation detail or mess with how this works. It will likely cause you more pain.
* This has improved in the last two years since we branched from PrestoDB 2019 (https://trino.io/blog/2020/01/01/2019-summary.html) and 2020 (https://trino.io/blog/2021/01/08/2020-review.html).
* Agreed, we are working on what's the better model here: https://github.com/trinodb/trino/discussions/6573