For instance, take the go-to classification model: Logistic Regression. Many people think they can draw insight by looking at the coefficients on the variables. If it’s 2.0 for variable A and 1.0 for variable B, then A must move the needle twice as much.
But not so fast. B, for instance, might be correlated with A. In this case, the coefficients are also correlated and interpretability becomes much more nuanced. And this isn’t the exception, it’s the rule. If you have a lot of features, chances are many of them are correlated.
In addition, your variables likely operate at different scales, so you’ll have needed to normalize and scale everything, which makes another layer of abstraction between you and interpretation. This becomes even more complicated when you consider encoded categorical variables. Are you trying to interpret each category independently, or assess their importance as a group? Not obvious how to make these aggregations. The story only gets more complicated for e.g. Random Forests.
I think it’s best to accept that you can’t interpret these models very well in general. At least in the case of some models (like neural nets), they approximate a Bayesian posterior, which has some nice properties.
(1) in a logistic regression the coefficients are on a log scale ergo the ratio between exp(2) and exp(1) is actually x2.7, not x2; the bigger issue is that you have to compare the strength of the association to how easy it is to move that lever, e.g. men might like our marketing message more than women, but it's not like we're going to get people to change gender.
(2) moderately correlated predictors do not bias or otherwise complicate the interpretation of regression parameters, it's only unmeasured confounders that do, that is, correlations between variables where one of the variables is not included in the model.
Or what if A always caused B, but the impact was slightly less than if B occurred without A? In that case, the sign on A might be negative, but its presence would actually tend to increase the probability of class=1, it's just that the positive impact has already been counted by variable B.
Maybe you try to avoid this situation by adding in an explicit interaction term of A*B, but then how do you interpret the impact of A since you now have more than one coefficient?
If you feel confident making assertive statements about what has been learned by looking at an equation fit on multi-correlated data, then your mathematical intuition is much stronger than mine!
This is a big thing most people miss when trying to interpret logistic regression the way they do linear regression. Logistic regression estimates are _conditional_ on the model spec in a way linear regression estimates are not.
Exactly. Rendering a model explainable is an active field of research at the moment. One way to tackle this is to use Shapley values from game theory.
Here is a compilation of techniques to render a model explainable: https://christophm.github.io/interpretable-ml-book/shapley.h..., and they have an elegant way to talk about these values:
> The Shapley value is the average marginal contribution of a feature value over all possible coalitions.
Now, it doesn't identify correlated variables (which you can do with other techniques), but will balance the influence in a robust manner.
----
> The story only gets more complicated for e.g. Random Forests.
Funny you should say this. Recently there was a challenge for explainable machine learning: http://explainable.ml
My proposal to the challenge was based on Random Forests. Here is the code: https://github.com/benoitparis/explainable-challenge, here is the paper: https://github.com/benoitparis/explainable-challenge/raw/mas...
I have a neat visualization for exploratory data analysis (available in the paper and in a notebook) that I'm proud of.
----
By the way if anyone is hiring here, I'm available for talking about your problem and see if my take on explainable machine learning can help you.
On the major point, while I agree with you, its much nicer to be able to show the "top" variables from a model, which is doable from logreg and forests, but is much, much, much more difficult from a neural net perspective.
Additionally, as they tend to take longer to train, its harder to iterate with them, and as they fit so very many parameters, I'm generally pretty sceptical as to their generalisability. That being said, in some tests I've run I've been pleasantly surprised at their performance.
Re: comment on showing "top" variables from a model, I agree this could have utility. But I would add that the devil's in the details, and there are multiple ways to calculate importance values, each of which has its own nuances and pros/cons.
For instance, how do you compare the importance of a categorical feature to a float feature? Do you one hot encode and then add their individual importances, take the average, or something else? Although sampling from the columns is meant to help deal with feature correlation, under what conditions is this effective and how do you know if your feature importances are safe? Moreover, how does this column sampling work in the context of one-hot-encoded categorical features?
This is all a way of saying that while you can devise methods for coming up with metrics, and then assign them handy titles like "Feature Importance", the reality is that these things are pretty nuanced and limited, and upper level management might be fooling themselves by thinking they're "interpreting a model" if they don't recognize the limitations and nuances involved. Or to put it another way, to say "I better understand this model because you gave me a feature importance list and a partial dependence plot," is a dangerous over-simplification.
Not sure what GP had in mind, but if a feature x appears in a dataset n times, with pn times with positive label, and (1-p)n times with negative, and your classifier is f(x) which is trained with the "cross-entropy" cost, then the ideal value, that minimizes the cost should be f(x) = p. In this sense, f(x) is the probability of positive given feature.
Whether neural nets really realize this and how reliable that is, is another question. But that's the intention of the cross entropy cost.
The mental gymnastics are that: - The objective function of the neural net was a likelihood. - The prior was improper. In which case the net is a MAP estimate.
A MAP estimate will not give you good uncertainty quantification. Given the application to risk modelling, this seems unlikely to be a trivial departure from a fully Bayesian method.
Modern software often adds very useful layers of abstraction onto existing processes and patterns. This is especially the case in the realm of machine learning software. Libraries like scikit-learn, Keras, and many others are outstanding pieces of work, and make it very easy to rapidly build and deploy ML models. However, this ease-of-use can actually be a detriment, especially to ML newcomers.
In particular, it is so easy with these types of ML libraries to do something like `from sklearn.linear_model import LinearRegression; model = LinearRegression(); model.fit(Xtrain, ytrain)`. This is great if you're trying to scalably test many different algorithms and configurations to see what predicts best. This is not so great if you're looking to test and validate some of the statistical assumptions of your model, especially with linear/logistic regression models. As an example, Python's StatsModels library will automatically warn you if certain assumptions of a linear/logistic regression model are violated/close to being violated, which could led to inappropriate conclusions/inference from the model. scikit-learn does not do this. If you have massive multicollinearity in your model (a phenomenon which can affect the reliability of individual-coefficient t statistics and the signs, positive or negative, associated with the coefficients), scikit-learn won't tell you that, and it will be on you to recognize the potential for multicollinearity occurring and remedy the issue.
Not to pick on scikit-learn, but their linear_model regression classes also don't provide p-values and standard errors associated with each predictor, common things that basic statistical modeling packages usually provide. But note that scikit-learn's goal is to provide an easy interface with which to do machine learning - not traditional statistical modeling. The ML community is known for placing emphasis on raw predictive performance of models and forgetting about validating the statistical assumptions associated with those models.
Classic regressions allow you to explain the marginal effects very well. It doesn't matter much how variables are correlated. If you saturate the model with interaction effects, you can get an accurate (in the sense of the model) prediction of the marginal effect of any variable as a function of others. This is very interpretable.
Furthermore, nowadays a lot of techniques are about estimating the causal effects based on assumptions in your data. You could use things like DID or synthetic control, use natural experiments, and so forth, to get a good idea of the causal "treatment effect" of your variable of interest, and you can even do this in a semi-parametric or non-parametric approach.
Often, estimating the linear approximation of the marginal on a conditional expectation is "good enough" to learn how things are connected within your data.
And in the end, getting this sort of causal effect of a variable is what we are really after in environments where the DGP process isn't simple. In that sense, this type of research is very compelling.
Scaling and other issues are of course important, but taking them into account is rather simple...
To be sure, a lot of work (for example in econometrics and elsewhere) is proceeding on causal inference of deep learning models, but it is probably also fair to say that right now, classical models are far easier to interpret, especially if you are interested in answering qualitative questions.
It's also unclear to me what you mean when you say that NNs uniquely approximate a Bayesian posterior, or why that's a good thing without knowing more about what posterior you're talking about. You could do a Bayesian logistic regression and get an actual posterior, and it would not remove the interpretation challenges you raise.
But then, the classical question: How do you debug the models? How do you know they are actually predicting what you think they are predicting?
Aren't we supposed to remove correlated predictors from models?
Or we can use techniques like PCA? But then, once again, we fall into the realm of unexplainable.
The problem is it's still difficult to determine direction of causality. That's why controlled experiments are so important.
And typically, you would demean, and divide by the standard deviation.
I guess the coefficients are harder to interpret in this second case than if you did not transform them, but they're still interpretable.
Why would we put a NN in charge of anything important if we can't explain how a particular model works?
Would you want your car or an aircraft you're on piloted by neural net the actions of which can't be explained?
What if it encounters an unforseen event that causes a flash crash or worse an actual crash that kills people?
Do you want to trust something built from incomplete data and simulated annealing with your life and livelihood?
Computer models are much younger, and we know they tend to have weird pathological corners, but unlike humans and their weird pathological corners, we have a much less firm grasp on what they are.
In many cases, humans have skin in the game, too. No computer model is yet sophisticated enough to be able to say that about.
There is also some irrationality in having someone to blame, etc. Certainly. But it's not the only part of the story.
More importantly, humans share the same brain architecture and work roughly the same way - you and me included. This makes it easier for us to understand other people (also having a part of the brain trying to directly model other people's minds helps). The way those NN-based black boxes work is completely alien to us (and each being a special snowflake made of hacks does not help).
Almost all cars and aircraft today are operated by neural nets, the actions of which can't be explained.
With humans, we generally know the bounds for unexpected behavior. We understand tiredness, confusion, fear, distraction, suicidal thoughts and other factors. We also know how to screen people to minimize those bounds.
With ML stacks, we have no good grasp on bounds. They usually work, for some definition of working, up until they don't - and when they fail, it's in some absurd (therefore hard to predict) way.
There's a big list here: https://t.co/OqoYN8MvMN
But it's stuff like:
- Evolved algorithm for landing aircraft exploited overflow errors in the physics simulator by creating large forces that were estimated to be zero, resulting in a perfect score
- A cooperative GAN architecture for converting images from one genre to another (eg horses<->zebras) has a loss function that rewards accurate reconstruction of images from its transformed version; CycleGAN turns out to partially solve the task by, in addition to the cross-domain analogies it learns, steganographically hiding autoencoder-style data about the original image invisibly inside the transformed image to assist the reconstruction of details.
- Simulated pancake making robot learned to throw the pancake as high in the air as possible in order to maximize time away from the ground
- Robot hand pretending to grasp an object by moving between the camera and the object
- Self-driving car rewarded for speed learns to spin in circles
All of which leads me to think that if you can't at some level explain how/what/why it's reaching a certain conclusion that it may be reaching a radically different end than you're anticipating.
In the process they broke it so badly I now prefer DDG. (Who also gives me results I didn't ask for.)
One way to not lose is to just pause the game. Minor spoiler: learnfun figures that one out :-)
If you can't explain the model, it means you don't know the assumptions that went into the model's output, which means you won't see it coming when the model doesn't work anymore. And if you don't want to look like a moron saying "oh but the model said...", (and not getting sued for mismanaging investors money).
Honestly, it's probably the investors asking questions that led them to this decision, but nonetheless, this is reason talking.
This is true, but there are many, many kinds of models that have basically zero explanatory power but have higher predictive capabilities than models that are easier to explain. They have been around a long time and are used for many different practical applications.
Unfortunately, the draw of that seemingly infallible super-high-predictive capability will almost certainly be heavily involved in financial markets before long. I have no problem if some people want to risk a bunch of money in a hedge fund that uses neural net models or whatever else, but having enough money controlled by these models could pose a serious systemic risk.
This is the part that worries me. For a decade before the 2008 financial collapse, people were quietly saying, "Gosh, there's a lot of activity in derivatives and we don't really know where the risk is going."
One of many factors there was the way rating agencies gave very generous ratings to mortgage securities. Critics note that it was in their short-term financial interest to do that. If people can screw up that badly with models they supposedly understand, it seems to me to be even more risky when working with models where people have just given up understanding and put their faith in the AI oracle. As long as they get the answers that maximize their end-of-year bonus checks, they have a strong incentive not to dig deeper.
Frankly, I think the problem here is "too new" or "too many variables". I doubt any of these managers understand Black-Scholes, yet they would sign off on using it because it's established, even though applying it also lost lots of people lots of money.
"It outperformed" seems to me to be reason enough, in fact not using it something that outperforms could be construed as "mismanagement" just as well.
Most option traders definitely understand black scholes, but that’s not really the point, cause there are more complex models that they would use to trade without knowing the details of.
The point is that there are quants who you need to trust with the models. And they’re most likely the ones who said: “this seems to work, but we don’t really know why, so we probably shouldn’t use it”. The fact that the top dogs agree with that is a sign of maturity.
Machine learning models are really good at things like prediction, but if it's valuable to do inference about the phenomenon (e.g., is there evidence that X is positively associated with the odds of Y, given Z,Q,R), careful study design and appropriate statistical models are a better choice. These come with theoretical underpinnings - whether that's the coverage guarantees of frequentist methods or the decision-theoretic foundations of Bayesian inference.
I'm not sure whether or not that means this choice was good on the part of BlackRock, however.
Indeed, because if the manager doesn't understand the model well enough to either mitigate its weaknesses or reserve sufficiently against them, they'll probably get fired some point down the line.
Your deliverable should be an interpretable model. You can (and probably should) make neural network models interpretable. If upper management does not trust your performance evaluation enough to bet on it, either the evaluation was weak (and no model should be deployed, however simple and interpretable) or upper management doesn't know enough about modern ML to have to make these decisions.
I have sympathy for the manager in charge for making a decision on a complex model (while all they ever knew was simple survival models and basic statistical models). But you got to move with the times. Your competitors will use the most powerful models available (and some may go under due to improper risk management). Your employees don't want to build logistic regression models until eternity.
I have no evidence of the scale and diversification of both these, so evidence would be helpful in refuting the above!
premise 1: financial crisis hits, requiring some firms to accept immediate loans (or off books loans aka qe) to maintain solvency (classic 2008 scenario)
premise 2: firms will not have equivalent exposure, so some firms fail worse than others, but as the risk is viewed as "systemic" all get the bailout
If some firms have AI that find risks hidden in investments that traditional (explainable) models ignore, then those firms will sit out of markets that will in the meantime be profitable for the firms that are unaware of the actual risk. Metaphorically, why ruin the 70s with an accurate HIV test.
If the same models could be used to identify and securitize (and make a market in) the invisible risk, it's possible that the market price of the risk would similarly lead many firms to sit out of otherwise profitable markets, as the yields of many of the traditional investments would (after the cost of hedging) be poor.
All this would result in a shrinking of the pie without an analytical explanation. "What do you mean the pie is smaller than we thought it was and we have to grow at a slower rate than we thought?", the CEO might ask.
In most scenarios where quantitative approaches give better insight into the future, the firm to develop the approach makes a fortune until others can catch up.
But what we have today is a financial system where keeping the overall system running hot is government policy, and so all participants have the incentive to ignore information that would lead to rational reallocation of investments.
Once the system's normal is leveraged/hot enough, the system becomes resistant to certain kinds of true information.
Philosophers' stones in the early twenty-first century Correlation, partial correlation, cross lagged correlation, principal components, factor analysis, OLS, GLS, PLS, IISLS, IIISLS, IVLS, LIML, SEM, HLM, HMM, GMM, ANOVA, MANOVA, Meta-analysis, logits, probits, ridits, tobits, RESET, DFITS, AIC, BIC, MAXNET, MDL, VAR, AR, ARIMA, ARFIMA, ARCH, GARCH, LISREL[...]...
The modeler's response We know all this. Nothing is perfect. Linearity has to be a good first approximation. Log linearity has to be a good secont approximation. THe assumptions are reasonable. The assumptions don't matter. The assumptions are conservative. You can't prove the assumptions are wrong. The biases will cancel. We can model the biases. We're only doing what everybody else does. Now we use more sophisticated techniques. If we don't do it, someone else will. What would you do? The decision-maker has to be better off with us than without us. We all have mental models. Not using a model is still a mode. The models aren't totally useless. You have to do the best you can with the data. You have to make assumptions in order to make progress. You have to give the models the benefit of the doubt. Where's the harm?
AI models totally fail to do what classical (and parsimonious, explainable, cheap...) methods/algos/models achieve quite easily (BS, Hawkes, RFSV, uncertainty zones, Almgren-Chriss/Cartea-Jaimungal... etc.). Actually, I'm tempted to say that AIs don't work at all.
I've seen so far funds leveraging "big data" with AIs (eg. realtime processing of satellite imagery, cameras, (more) news...) and get more/better information (than the others) to finally calibrate and use these (parsimonious) models, nothing (interesting) else.
Do not get fooled. Lots of banks announced that they use AIs, to surf on the hype, because today if you don't do AIs, you're not in, because today everyone is a Data Scientist, that's all.