We’re really excited to release this feature after months of R&D. Many of our customers want to understand the causal impact of their products, but are unable to iterate quickly enough running A/B tests. Rather than taking the easy path and serving correlation based insights, we took the harder approach of automating causal inference through what's known as an observational study, which can simulate A/B experiments on historical data and eliminate spurious effects. This involved a mix of linear regression, PCA, and large-scale custom Spark infra. Happy to share more about what we did behind the scenes!
This is 100% overselling. Observational studies can be suggestive, but cannot replace experiments. Unobserved variables cannot be accounted for.
- Do you use a causal graph? Would it make sense?
- Spark seems over-kill for what you yourself describe as regression: is there something more intensive here that we could be missing?
So what we've done instead is create a SparkML task that can read in a feature matrix and trains and scores the causal analytics model. The causal lift estimates for each user are then written out to BigQuery so that in our frontend a customer can filter for, say, users between the ages of 18-35, and then within seconds we'll return them the causal lift of viewing page X for this segment.
Congratulations! Just remember to patent it :)
And yes, the timing of the blog post isn't a coincidence, we actually filed a patent last week :)
I’m asking because we are building our own implementation of mSPRT. There are some variants but I didn’t expect enough to patent. We are confronted with internal debates and I’d rather have actual examples than ageing memory of my law class.
From the article, this seems like a normal regression to me. Would be interesting to know what makes it causal (or at least better) compared to an OLS. PCA has been used for a long time to select the features to use in regression. Would it be accurate to say that the innovation is on how the regression is calculated rather than the statistical methodology?
Either way, it would interesting to test this approach against an A/B test and check how much an observational study differs from the A/B estimates, and how sensitive is this approach to including (or not) a set of features. Also would be interesting to compare it to other quasi-experimental methodologies, such as propensity score matching.
Is there a more extended document explaining the approach?
Good luck!
Yes, we actually explored other approaches such as PSM. The main reason we did not initially go with PSM was because of the compute power required - you would need to train a model for each treatment variable. However, we're actually in the midst of developing a way to train a model for each treatment variable efficiently, which will allow us to add items such as inverse propensity weighting (or explore other approaches such as PSM).
Did you all consider using Double Selection [1] or Double Machine Learning [2]?
The reason I ask is that your approach is very reminiscent of a Lasso style regression where you first run lasso for feature selection then re-run a normal OLS with only those controls included (Post-Lasso). This is somewhat problematic because Lasso has a tendency to drop too many controls if they are too correlated with one another, introducing omitted variable bias. Compounding the issue, some of those variables may be correlated with the treatment variable, which increases the chance they will be dropped.
The solution proposed is to run two separates Lasso regressions, one with the original dependent variable and another with the treatment variable as the dependent variable, recovering two sets of potential controls, and then using the union of those sets as the final set of controls. This is explained in simple language at [3].
Now, you all are using PCA, not Lasso, so I don't know if these concerns apply or not. My sense is that you still may be omitting variables if the right variables are not included at the start, which is not a problem that any particular methodology can completely avoid. Would love to hear your thoughts.
Also, you don't show any examples or performance testing of your method. An example would be demonstrating in a situation where you "know" (via A/B test perhaps) what the "true" causal effect is that your method is able to recover a similar point estimate. As presented, how do we / you know that this is generating reasonable results?
[1] http://home.uchicago.edu/ourminsky/Variable_Selection.pdf [2] https://arxiv.org/abs/1608.00060 [3] https://medium.com/teconomics-blog/using-ml-to-resolve-exper...
Regardless, we're never going to completely remove omitted variable bias, as we're never going to capture 100% of relevant variables. One way we monitor our model's bias is by looking at the error distribution between users in the treatment vs control. If these aren't similar, there's too much bias in our estimate of the treatment effect, so we wouldn't want to serve an estimate of the treatment effect for this variable to our customers.
The current product is in beta and we're working with some of our current customers to try to re-create our results with A/B tests. I'm hoping that by our GA release in the fall we'll have some case studies with specific examples!
[1] http://www.win-vector.com/blog/2016/05/pcr_part2_yaware/
Granger causality for estimating Granger cause