Show HN: MetricFlow – open-source metric framework (opens in new tab)

(github.com)

98 pointsnicholashandel4y ago26 comments

Hi HN community, I’m Nick, co-founder/CEO of Transform.co. I’m thrilled to share MetricFlow, an open-source metric creation framework: https://github.com/transform-data/metricflow

MetricFlow strives to make what has historically been an extremely repetitive process, writing SQL queries on core normalized data models, much more DRY. MetricFlow consolidates the definitions for joins, aggregations, filters, etc., and programmatically generates SQL to construct data marts. You can think of it like LookML, but more powerful and ergonomic (and open source!). The project has three components:

1. MetricFlow Spec: The specification encapsulates metric logic in a more reusable set of abstractions: data_sources, measures, dimensions, identifiers, metrics, and materializations.

2. DataFlow Planner: The Query Planner is a generalized SQL constructor. We take in data sources (ideally normalized data models) and generate a graph of data transformations (a flow, if you will) – joins, aggregations, filters, etc. We take that graph and render it down to db-specific SQL while optimizing it for performance and legibility.

3. MetricFlow Interfaces: The CLI and Python SDK rely on the flexibility of the Spec and Planner to build just about any query you could ask for on top of your data warehouse.

These components enable novel features that other semantic layers struggle to support today:

- MetricFlow enables the user to traverse the entire graph of a company’s data warehouse without confining their analysis to pre-built data models (dbt), Explores (in Looker), or Cubes (in lots of tools).

- The Metric abstraction allows the construction of complex metrics that traverse the graph described above to rely on multiple data sources. We support several common metric types today, and adding more is a critical part of the open-source roadmap.

- The Materialization abstraction allows users to define and then programmatically generate data marts that rely on a single DRY expression of the metrics and dimensions.

MetricFlow is open source(https://github.com/transform-data/metricflow) and distributed through pypi (`pip install metricflow`). You can set up (`mf setup`) a set of sample configs and try out a tutorial (`mf tutorial). The docs are all here(https://docs.transform.co/docs/overview/metricflow-overview). We’d love contributions on GitHub. We’re adding new Issues to share our roadmap in the coming days, but feel free to open your own.

We’re also opening up a Slack community(https://community.transform.co/metricflow-signup) to talk about the project and, more generally, metric tooling.

Let us know what you think – we’ll be here answering any questions!

Show HN: MetricFlow – open-source metric framework

(github.com)

98 pointsnicholashandel4y ago26 comments

Hi HN community, I’m Nick, co-founder/CEO of Transform.co. I’m thrilled to share MetricFlow, an open-source metric creation framework: https://github.com/transform-data/metricflow

1. MetricFlow Spec: The specification encapsulates metric logic in a more reusable set of abstractions: data_sources, measures, dimensions, identifiers, metrics, and materializations.

3. MetricFlow Interfaces: The CLI and Python SDK rely on the flexibility of the Spec and Planner to build just about any query you could ask for on top of your data warehouse.

These components enable novel features that other semantic layers struggle to support today:

- The Materialization abstraction allows users to define and then programmatically generate data marts that rely on a single DRY expression of the metrics and dimensions.

We’re also opening up a Slack community(https://community.transform.co/metricflow-signup) to talk about the project and, more generally, metric tooling.

Let us know what you think – we’ll be here answering any questions!

26 comments

22 comments · 8 top-level

pstoll4y ago· 4 in thread

Cool. Like open source Looker.

We adopted Looker at $previous_job. Then they got bought by Google, which was great for us as we were becoming a big GCP customer. I strongly encouraged google / looker team to at least open source their LookMl (looker modeling language - equivalent to MQL). They couldn’t figure it out.

This type of metric definition is so empowering for businesses. Not enough engineers grok why this is useful.

hintymad4y ago

Interested in why it's useful too. If the same SQL is used in multiple places and cause confusion, isn't it an organizational problem instead of a technical problem? For instance, what if two teams create two different and conflicting metric definitions to answer the same question? It's like turtles all the way down and how could we prevent diversion of query definitions even if we have a perfect metric definition system?

tlento4y ago

I agree this is an organizational problem, but a lot of the tools we use every day - from bug trackers to distributed version control to collaborative document authoring tools - are technical solutions to organizational problems. One key organizational problem data people routinely deal with is exactly this issue of multiple metric definitions.

Imagine a company that wants to know how much time its users spend on its application, and wants that broken down by country in order to attract investors or something. They probably don’t have one metric for time spent on the application. They have like 12, all of them are different, and although all of them are wrong they each work sort of ok for a given context. To make matters worse, those 12 metrics are defined in different places, computed in different ways, and generated from different data sources. Maybe 3 of them are documented somewhere in the company wiki while the details of the rest live in the heads of 9 current or former employees who are off doing.... something.

That is a mess. Investors don’t care about all that nuance on use cases for time spent measures. Neither does anybody talking to the investors. What they don’t want to hear, when they ask for this simple metric, is “well..... what are you planning to use this for?” On the other hand, things get even worse if somebody grabs the wrong time spent metric and presents that publicly without realizing one of the other 11 was the one they really needed.

It would be better if our hypothetical company had its people define and document a set of consistent metrics and ensure they’re always computed the same way. By centralizing those definitions into a repository (ideally managed by source control) everybody can share the understanding of what a given metric represents. More importantly, they know where to look for pre-existing metric definitions. If this hypothetical company really needs 12 different time spent metrics (they don’t, but bear with me here) then each of those metrics can be named and defined in a way that explains what it’s good for, and they can simply point the makers of their investment pitch at the “baseline_user_time_spent” metric and have done with it.

For any of that to even be possible you need, at a minimum, a place to store metric definitions and a metric layer like MetricFlow that understands those metric definitions, translates them into the relevant queries, and executes them with the required filters and grouping attributes.

1 more reply

fortysixdegrees4y ago

Could you elaborate on why it's useful. I don't get it but I'm intrigued

nicholashandelOP4y ago

Traditionally, querying business and product metrics for data analysis has required lots of ad-hoc sql queries. Often they are encoded in dashboards, in ETL pipelines, and others are copied directly. Semantic layers act as a single source of truth that encapsulates all of that logic. Having a single layer responsible for querying these datasets enables powerful workflows:

- It enables self-serve analytics experiences because it creates objects that business people can interact within pivot-table-like forms. Hundreds of lines of SQL are distilled to simple queries.

- The logic is DRYer / easier to govern than all the repetitive SQL rollups that are required to answer business questions.

- Analysts become more productive write the same stuff less and ask more questions faster.

The idea of limiting duplicated logic is very well understood in the software engineering community and desired in the analytics community but we’re still in the early days. In practice, this is really hard in SQL and the tools we have are too limited.

More specifically, the reason why I get excited about MetricFlow

- We basically built a generalized SQL constructor. It will be able to build performant and legible SQL for complicated requests (things that data engineers describe in hundreds of lines) through simple and consistent query interfaces.

- The way we encapsulate logic requires much fewer lines of yaml/code than most other frameworks and we can do much more with those lines. LookML and previous versions we worked on at Airbnb became quite unruly because of the choices in the abstractions.

- The metric abstraction is flexible and allows us to calculate complicated metrics with only a few lines of yaml. That means we can define metrics like conversion metrics that might take joining two data sources together, deduplicating, filtering to a conversion window, etc. in a single way with a few parameters that reference existing objects.

1 more reply

mjirv4y ago· 2 in thread

This is awesome, though I would love some more detail in the documentation.

What’s the quick pitch for why I should use this instead of Cube or dbt’s metrics layer?

nicholashandelOP4y ago

I think it’s probably best to talk about this comparison in three areas:

Semantics - The MetricFlow spec allows the construction of a much broader range of metrics with much less expression of logic or duplication of that logic than dbt or Cube.

Performance - MetricFlow generates queries that rivals the optimizations of a skilled Data Engineer and builds pre-aggregated tables similar to Cube while dbt builds a static query from a jinja macro.

Interfaces - Cube has some great interfaces for frontend developers, dbt just generates SQL at this point, and MetricFlow has a Python and CLI . The hosted version, Transform, comes with a SQL and GraphQL Interface but that is beyond the scope of the OSS project.

nicholashandelOP4y ago

If you’re interested, the longer version:

Semantics

MetricFlow has a less configuration relative to these other frameworks. We accomplish this by choosing abstractions that allow us to handle more on our side at query time through the DataFlow Plan builder. Working with the SQL constructions as a dataflow enables extensions such as non-dw data sources, or using other languages(Python) for some transformations.

The dbt spec is relatively new and requires a few extremely unDRY expressions. The most obvious is the lack of support for joins which means you simply won’t be able to answer most questions unless you build huge tables. There are a few other issues with the abstractions. For example, dimensions are defined multiple times across metrics. A few folks posted more about these challenges in their Github Issue but they’re sticking to their spec. I’m skeptical it will work at any scale.

The Cube concept is similar to Explores in Looker. They’re limiting because you end up with a bunch of representations of small domains within the warehouse and the moment you hit the edge of that domain you need to add a new Cube/Explore. This is not DRY and it’s frustrating. There is also no first-class object for Metrics which means you’re limited to to relatively simple metric types.

Performance

MetricFlow has the flexibility of the DataFlow Plan Builder and builds quite efficient queries. The Materialization feature allows you to build roll up tables programmatically to the data warehouse which could then be used as a low-latency serving layer.

dbt is a jinja macro and generates a static query per metric requested: [https://github.com/dbt-labs/dbt_metrics/blob/main/macros/get.... This macro will be quite hard to optimize for more complicated metric types. We struggled a ton with this before refactoring our framework to allow the manipulation and optimizations of these DataFlow Plans.

Cube is pretty slick on caching, but I know less about their query optimizations. They have some awesome pre-aggregation and caching features. I think this comes from their background in serving frontend interfaces.

Interfaces

MetricFlow supports a Python SDK and our CLI, today. Transform has a few more interfaces (SQL over JDBC, GraphQL, React) that sit outside the scope of this OSS project.

dbt only builds a query in the dbt context today. TBD what the dbt server does but I imagine it will expose a JDBC for paying customers.

Cube seems more focused on building custom data applications but has recently pivoted to the analytics front. I haven’t seen those interfaces in action but I’m curious to learn more there.

seektable4y ago· 2 in thread

It looks like MetricFlow shines in constructing SQL queries on-demand, which means that it should be directly used by a BI tool, am I right with this?.. Generation of the static SQL (with CLI) for each report doesn't seem very usable on practice.

In other words, BI tools needs to have a special connector that automatically utilizes MetricFlow Python API (or CLI). What BI tools already can use MetricFlow in this way (open-source part of the project)?

Actually I'm asking about that as a BI tool vendor. We have added an ability to use custom connectors (web API) so potentially this kind of connector can use MetricFlow for SQL generation.

mitchelllc4y ago

Looks like it supports GraphQL APIs[1], and downstream BI applications should be able to consume metric results from MetricFlow through GraphQL.

[1] https://github.com/transform-data/metricflow#features

nicholashandelOP4y ago

That's right! That's a potential option for an integration.

The other options are: - Transform's JDBC can be used to connect to tools that have SQL interfaces like Mode, Hex, Deepnote, etc.: https://docs.transform.co/docs/api/sql/sql-overview - Materializations can be exported as constructed data marts to tools like Tableau / Looker that take in constructed data sources: https://docs.transform.co/docs/metricflow/reference/material...

XCSme4y ago· 2 in thread

It's a bit unclear how would I go with integrating MetricFlow with https://uxwizz.com for example that uses a MySQL database to store analytics data. From the docs, I don't really understand how it actually "understands" the underlying SQL database and how to retrieve the data I need. It feels like I have to write the query to get the data I want, but in a different syntax. Is there any point to use MetricFlow if you only have one data source?

tlento4y ago

Three things:

First, MetricFlow does not currently support MySQL. We launched with support for BigQuery, Redshift, and Snowflake. I have opened an issue to add support for MySQL (and similar issues for other SQL engines are coming): https://github.com/transform-data/metricflow/issues/27

Second, what we call a data source is more similar to a table in a database, rather than the underlying database service itself. Metricflow itself is useful when you're using a single SQL engine - indeed, that's all we support today - but it is most useful when you're in a world where joins are a thing. That said, if you have one big data table you might still find it useful to have declarative metric definitions defined in Metricflow. Suppose, for example, you had a big NoSQL style table filled with JSON objects. You might define a few data sources that normalize those JSON objects into top level elements (identifiers, dimensions, aggregated measures) using the sql_query data source config attribute, and then that'd allow you to support structured queries on the data consumption end while pushing unstructured blobs from your application layer. This will be slow at query time, and only as reliable as the level of discipline exerted in your application development workflow, but it's possible.

Third, if we did support MySQL you'd basically connect to it via standard connection parameters - we have a config file where you can store the required information and then we'll manage the connections for you. However, I'm not familiar with uxwizz, and a quick perusal of their documentation did not turn up how one goes about connecting to the underlying DB. It's likely I just missed this, but at any rate I don't know how it is done. If they don't support standard MySQL client connections you'd need to write an adapter of some kind against whatever DB connection APIs they provide, in which case you'd likely need to roll a custom implementation of MetricFlow's SqlClient interface and initialize the MetricFlowEngine with that.

XCSme4y ago

Thanks for the response!

With data sources, I mean if you have multiple services/products delivering data (e.g. an analytics platform, a CRM, all hosted on different servers).

> uxwizz, and a quick perusal of their documentation did not turn up how one goes about connecting to the underlying DB

UXWizz is self-hosted, so you just have a basic MySQL/MariaDB database that you have full control over (so you can connect remotely with the host/db/username/password you create). This is the database structure: https://docs.uxwizz.com/guides/database-querying

> you have one big data table you might still find it useful to have declarative metric definitions defined in Metricflow

Oh, so this is like saving queries? Why would I write the MetricFlow config instead of saving the SQL query directly? I had look over https://docs.transform.co/docs/metricflow/guides/introductio... but I found the concepts and config file a bit hard to understand (but maybe I'm not in the target audience).

1 more reply

moltar4y ago· 2 in thread

How does it compare and plans to compete with dbt metrics?

pstoll4y ago

Quick scan of dbt metrics - looks like it lets you define a flat, single view of a metric with a predefined set of “filters” (Ie sql predicates). Also it only allows a metric to come from a single table; “joins” are a TODO.

It is missing the runtime engine to “interpret” the semantic model and so arbitrary but correct joins among tables and generate the appropriate SQl to realize those arbitrary joins.

If anyone know dbt better and also knows semantic models, please correct.

nicholashandelOP4y ago

I think those points are on the mark. I added some notes to the similar question that asked for a comparison of dbt, cube and MetricFlow!

I'll just say Joins are hard, but if solved elegantly they allow you to do some really interesting stuff. As an example in MetricFlow you can traverse the data graph in your queries. For example if you go `pip install metricflow` and try the tutorial: `mf_tutorial`

You can see what I mean by running queries that traverse the multiple joins to get to other dimensions:

1. Ask for transactions by day by customer `mf query --metrics transactions --dimensions ds,customer --order ds`

2. Ask for transactions by day by customer country `mf query --metrics transactions --dimensions ds,customer__country --order ds`

3. Ask for transactions by day by customer region where we traverse through a country to region mapping `mf query --metrics transactions --dimensions ds,customer__country__region --order ds`

theboat4y ago· 1 in thread

Thank you for open sourcing this. More competition in the budding metrics ecosystem is good for end users.

It seems like you think MetricFlow should be the data mart layer and not just the metrics layer. If that's true...why? Why would I join my fact and dimension tables in metricflow instead of in dbt? One of the value adds of dbt is that it centralizes business logic in a single place. Joins are business logic. The industry seems to be moving towards creating very wide data mart tables in dbt and surfacing them to the semantic layer 1:1, or building the metrics layer on top of them.

tlento4y ago

I'd say we think MetricFlow should be able to provide consistent, correct answers to reasonable queries end users of the metric model might ask. To do this across the various data warehouse layouts our users are likely to encounter we must necessarily provide support for dimensional joins. This doesn't mean MetricFlow should displace data mart services - to the contrary, I contend MetricFlow works best when layered on top of a warehouse built on centralized logic for managing its data layout. As an example, we generally push our customers to rely on the sql_table data source definition and push any sql_query constructs down to whatever warehouse management layers they have in place.

That said, you need to support joins, at least in some limited scope, in the semantic metric layer for it to be broadly useful. Consider this scenario - you have your dbt models producing wide tables for reasonable measure/dimension queries, and you have MetricFlow configs for the metric and dimension sets available in your data mart. Now imagine you've also got your finance team hooked up to a Google Sheets connector, and they're looking at revenue and unique customers by sales region. Cool, your wide table has that built in, no joins needed.

But what if they want something new? Let's say they want to know how they're doing against the target addressable market in each country. Should they have to submit a ticket to the data engineering team to add customer.country.market_size to your revenue table? Or should they be able to do "select revenue by customer__country__market_size" and get the report they need?

Our position is that we want to facilitate the latter - people getting what they need and knowing, as long as it's been defined properly in the model, that it's going to produce reasonable results. If your particular organization wants all of those joins run through a data mart ticket queue and surfaced as fully denormalized underlying tables that's fine by us, but most likely that's not what you want. You'd rather have some visibility into the types of joins people are requesting and then build out your data mart to more efficiently serve the requests people have on the ground, while still allowing them to ask new questions of the data without a long development feedback loop.

mritchie7124y ago· 1 in thread

Is MetricFlow what Transform (the product) uses under the hood?

nicholashandelOP4y ago

Yes! MetricFlow is core to everything we do so there will be tons of active development from our team.

awinter-py4y ago

love this as an area of innovation

my wishlist item for 'standard metrics definitions' is for libraries + servers to ship with a spec of what they export

so that if I'm using, for example, a plugin for a reverse proxy, or a twilio verification library, it can install its own metrics and alerts in my dashboard system

j / k navigate · click thread line to collapse

26 comments

22 comments · 8 top-level

pstoll4y ago· 4 in thread

Cool. Like open source Looker.

This type of metric definition is so empowering for businesses. Not enough engineers grok why this is useful.

hintymad4y ago

tlento4y ago

1 more reply

fortysixdegrees4y ago

Could you elaborate on why it's useful. I don't get it but I'm intrigued

nicholashandelOP4y ago

- It enables self-serve analytics experiences because it creates objects that business people can interact within pivot-table-like forms. Hundreds of lines of SQL are distilled to simple queries.

- The logic is DRYer / easier to govern than all the repetitive SQL rollups that are required to answer business questions.

- Analysts become more productive write the same stuff less and ask more questions faster.

More specifically, the reason why I get excited about MetricFlow

1 more reply

mjirv4y ago· 2 in thread

This is awesome, though I would love some more detail in the documentation.

What’s the quick pitch for why I should use this instead of Cube or dbt’s metrics layer?

nicholashandelOP4y ago

I think it’s probably best to talk about this comparison in three areas:

Semantics - The MetricFlow spec allows the construction of a much broader range of metrics with much less expression of logic or duplication of that logic than dbt or Cube.

nicholashandelOP4y ago

If you’re interested, the longer version:

Semantics

Performance

Interfaces

MetricFlow supports a Python SDK and our CLI, today. Transform has a few more interfaces (SQL over JDBC, GraphQL, React) that sit outside the scope of this OSS project.

dbt only builds a query in the dbt context today. TBD what the dbt server does but I imagine it will expose a JDBC for paying customers.

Cube seems more focused on building custom data applications but has recently pivoted to the analytics front. I haven’t seen those interfaces in action but I’m curious to learn more there.

seektable4y ago· 2 in thread

Actually I'm asking about that as a BI tool vendor. We have added an ability to use custom connectors (web API) so potentially this kind of connector can use MetricFlow for SQL generation.

mitchelllc4y ago

Looks like it supports GraphQL APIs[1], and downstream BI applications should be able to consume metric results from MetricFlow through GraphQL.

[1] https://github.com/transform-data/metricflow#features

nicholashandelOP4y ago

That's right! That's a potential option for an integration.

XCSme4y ago· 2 in thread

tlento4y ago

Three things:

XCSme4y ago

Thanks for the response!

With data sources, I mean if you have multiple services/products delivering data (e.g. an analytics platform, a CRM, all hosted on different servers).

> uxwizz, and a quick perusal of their documentation did not turn up how one goes about connecting to the underlying DB

> you have one big data table you might still find it useful to have declarative metric definitions defined in Metricflow

1 more reply

moltar4y ago· 2 in thread

How does it compare and plans to compete with dbt metrics?

pstoll4y ago

It is missing the runtime engine to “interpret” the semantic model and so arbitrary but correct joins among tables and generate the appropriate SQl to realize those arbitrary joins.

If anyone know dbt better and also knows semantic models, please correct.

nicholashandelOP4y ago

I think those points are on the mark. I added some notes to the similar question that asked for a comparison of dbt, cube and MetricFlow!

You can see what I mean by running queries that traverse the multiple joins to get to other dimensions:

1. Ask for transactions by day by customer `mf query --metrics transactions --dimensions ds,customer --order ds`

2. Ask for transactions by day by customer country `mf query --metrics transactions --dimensions ds,customer__country --order ds`

3. Ask for transactions by day by customer region where we traverse through a country to region mapping `mf query --metrics transactions --dimensions ds,customer__country__region --order ds`

theboat4y ago· 1 in thread

Thank you for open sourcing this. More competition in the budding metrics ecosystem is good for end users.

tlento4y ago

mritchie7124y ago· 1 in thread

Is MetricFlow what Transform (the product) uses under the hood?

nicholashandelOP4y ago

Yes! MetricFlow is core to everything we do so there will be tons of active development from our team.

awinter-py4y ago

love this as an area of innovation

my wishlist item for 'standard metrics definitions' is for libraries + servers to ship with a spec of what they export

so that if I'm using, for example, a plugin for a reverse proxy, or a twilio verification library, it can install its own metrics and alerts in my dashboard system

j / k navigate · click thread line to collapse