Why hasn't Presto become industry standard?

14 pointsvizually5y ago14 comments

if Presto SQL can query every database under the Sun (from RDBMS, Hadoop HDFS, Kafka and ElasticSearch) why it has not become industry standard yet? What's the catch? What are its limitation?

14 comments

13 comments · 5 top-level

marklit5y ago· 3 in thread

I've been using it for years with clients. It tends to sit in between source data and a final destination. Most data platforms are trying to take data from 10s if not 100s of sources and unify them. The variety of formats and sources is endless and I've often had to resort to using Python-based Airflow DAGs to collect data based on dates and store a cleaned up version of what was collected as a timestamped PQ file for Presto. Presto is then great for large scale transformations across the data and ad-hoc exploring.

Presto can be pointed at a lot of data stores but few external data providers offer ODBC-like interfaces. It seems to be either APIs or static file dumps for the most part. So Presto isn't going to be able to pull from these datasets alone.

In terms of security and maintenance, products like Redshift are easier to train traditional data warehouse people up on. The service is relatively cheap and has a nice UI for scaling.

The data world is extremely fragmented. Once firms have something in place changing it is going to be a struggle. Existing staff often gate keep and defend whatever technology they've staked their careers on. Once there are a lot of reports setup with any on data source migrating it could end up becoming a prolonged project which can be hard to sell.

It was quoted Snowflake had a $1M / day budget for sales and marketing. I'm not aware of any Presto consultancy spending that sort of money. Amazon does have Athena but they have countless other offerings which muddies the water.

vizuallyOP5y ago

@marklit,

Thanks for your insights. Great points on inertia and lack of big sales budgets. Agreed also on HDFC/data lake use cases with PQ files. However, regarding querying RDBMS, are you saying that Presto requires in ODBC/JDBC connectivity? Does Presto have an ability to connect with "native DB" drivers?

marklit5y ago

I can't comment on native drivers. When I said ODBC-like interfaces I was trying to use a catch-all phrase for sources that look like a typical server-based data store (i.e. PostgreSQL, Hive, Kafka).

vizuallyOP5y ago

also, looked at Snowflake's most recent quarterly results. They spent over $1.7 million/day in sales & marketing.

gwittel5y ago· 2 in thread

Lots of general reasons, inertia, etc. Often companies stick to 1-2 preferred technologies and adding Presto isn't seen as a gain (even though it helped Facebook quite a bit). I also suspect Amazon re-packing it as Athena reduced adoption to some extent.

If looking at Presto (now Trino), the main thing to keep in mind is that you inherit the limitations of the underlying data store.

Its best when the underlying store (+ the Db adaptor implementation) lets you parallelize work and keep each node busy, and avoid processing data unnecessarily. Hive/S3 columnar format data works great for this (IIRC this was a major early use case). Other sources like RDBMS will have natural limitations. Kafka has its own issues since each query generally means re-scanning a topic, etc.

I see the data bridges as most useful as a way to bring data into the native/optimal format. Then do the heavy lift work in Presto.

vizuallyOP5y ago

@gwittel, appreciate you sharing your insights. Will you be able to elaborate on "RDBMS will have natural limitations"? Can you provide a specific example?

gwittel5y ago

Presto gets most of its speed from parallelizing work and taking advantage of columnar formats when it can.

In the case of a RDBMS can you get performance gains if you try to parallelize a query from many clients? It will depend on the DB adapter and query. In a random case, if you slice a query into N shards it’s not necessarily going to go faster. It’s still the same DB underneath bound by the same HW performance boundaries.

1 more reply

tqh5y ago· 2 in thread

This might be outdated info:

  * Two different Prestos, prestodb and prestosql for maximum confusion. (I think one renamed)
  * Making Controller highly available by default is hard
  * Autoscaling workers is not simple
  * Code very dependent on its own webframework that tries to do everything and lacks docs.
  * Resource planner for multiple queries is lacking
  * Worker configuration takes a lot of skill

All of these could be solved, but in most cases you can find other solutions where you get a simpler set of problems.

vizuallyOP5y ago

tqh, Sounds like you hit the nail on the head with your answer. Thank you very much for your insights.

bitsondatadev5y ago

Hey I'm a contributor to prestosql (the one that renamed to Trino). I'll provide a few of my opinions into some of these from the vantage point of our project.

* It's definitely confusing but pretty common in open source projects to see the original creators split off when corporate oversight interferes with the OS governance model. (https://www.computerworld.com/article/2746627/hudson-devs-vo...). This is especially true when, as the OP mentioned, it's a pretty cool tech and a lot of interest in it. Now that the names are different, it is clearing up a bit. We're hoping in a few years there will be one project standing so that you won't have to choose. I don't have to tell you which one I think it is.

* Active-active HA is not really necessary IMO as Trino is designed for low latency interactive queries in general. It can handle longer running batch queries but it gives up fault tolerance to fail fast and you just resubmit the query vs predecessors like Hive, Spark, etc... that handle ETL and long running batch processes efficiently but this adds complexity to the query to checkpoint the work. I could see the need for an active-passive HA to have on deck during a failure. Setting up your own active-passive HA is as simple as putting two coordinators behind a proxy and pointing your workers to the proxy address. Then you basically have the proxy run health checks and flip over in the event of an outage. Here's the issue to track native HA though https://github.com/trinodb/trino/issues/391.

* I'm not sure why autoscaling is said to be difficult. I think this is why you have kubernetes and docker to manage this type of workload.

* The only reason this is a pain to me is that engineers wanting to join our community and commit have a bit of a learning curve and depends heavily on us mentoring and guiding them on how the REST API works, which we don't mind. However, I agree with this choice from a design perspective for the user. If you want to use Trino, it's better not to be exposed to this implementation detail or mess with how this works. It will likely cause you more pain.

* This has improved in the last two years since we branched from PrestoDB 2019 (https://trino.io/blog/2020/01/01/2019-summary.html) and 2020 (https://trino.io/blog/2021/01/08/2020-review.html).

* Agreed, we are working on what's the better model here: https://github.com/trinodb/trino/discussions/6573

legg0myegg05y ago· 1 in thread

Dremio also appears to take a similar approach, but with more advanced caching features / query pushdown. Plus it has Apache Arrow at its heart. I think that would be my choice of solution in this space

vizuallyOP5y ago

@legg0myegg0 thanks for bringing up Dremio. Can Dremio connect with different databases RDBMS, Kafka, DataLake PRC file formats? Is the use case limited to certain data stores? Is the use-case primarily a united query engine (so that the code remains consistent across DB engines) or is the use case query acceleration?

curmudgeon225y ago

It adds complexity in infrastructure and becomes a separate translation layer. I haven’t run a presto deployment, but have worked adjacent to it, I do think it brings advantages. But a lot of client side ‘whatevers’ connect to a lot of different database/storage technologies already, adding another complex layer to the mix might not be worth it.

j / k navigate · click thread line to collapse

14 comments

13 comments · 5 top-level

marklit5y ago· 3 in thread

In terms of security and maintenance, products like Redshift are easier to train traditional data warehouse people up on. The service is relatively cheap and has a nice UI for scaling.

vizuallyOP5y ago

@marklit,

marklit5y ago

I can't comment on native drivers. When I said ODBC-like interfaces I was trying to use a catch-all phrase for sources that look like a typical server-based data store (i.e. PostgreSQL, Hive, Kafka).

vizuallyOP5y ago

also, looked at Snowflake's most recent quarterly results. They spent over $1.7 million/day in sales & marketing.

gwittel5y ago· 2 in thread

If looking at Presto (now Trino), the main thing to keep in mind is that you inherit the limitations of the underlying data store.

I see the data bridges as most useful as a way to bring data into the native/optimal format. Then do the heavy lift work in Presto.

vizuallyOP5y ago

@gwittel, appreciate you sharing your insights. Will you be able to elaborate on "RDBMS will have natural limitations"? Can you provide a specific example?

gwittel5y ago

Presto gets most of its speed from parallelizing work and taking advantage of columnar formats when it can.

1 more reply

tqh5y ago· 2 in thread

This might be outdated info:

  * Two different Prestos, prestodb and prestosql for maximum confusion. (I think one renamed)
  * Making Controller highly available by default is hard
  * Autoscaling workers is not simple
  * Code very dependent on its own webframework that tries to do everything and lacks docs.
  * Resource planner for multiple queries is lacking
  * Worker configuration takes a lot of skill

All of these could be solved, but in most cases you can find other solutions where you get a simpler set of problems.

vizuallyOP5y ago

tqh, Sounds like you hit the nail on the head with your answer. Thank you very much for your insights.

bitsondatadev5y ago

Hey I'm a contributor to prestosql (the one that renamed to Trino). I'll provide a few of my opinions into some of these from the vantage point of our project.

* I'm not sure why autoscaling is said to be difficult. I think this is why you have kubernetes and docker to manage this type of workload.

* This has improved in the last two years since we branched from PrestoDB 2019 (https://trino.io/blog/2020/01/01/2019-summary.html) and 2020 (https://trino.io/blog/2021/01/08/2020-review.html).

* Agreed, we are working on what's the better model here: https://github.com/trinodb/trino/discussions/6573

legg0myegg05y ago· 1 in thread

vizuallyOP5y ago

curmudgeon225y ago

j / k navigate · click thread line to collapse