What is Interpretable AI?
Published:
My understanding to Interpretable AI
Introduction to Interpretable AI
Source: Interpretable Machine Learning by Christoph Molnar
Interpretable AI is AI for which humans can understand the reasoning behind why a particular model made a prediction. There is no real consensus about what interpretability is in AI or how to measure it, however various approaches have been proposed to classify and categorize interpretable AI/ML methods.
Interpretable models can be categorized in many ways. I have summarized some of these categorizations below. These categorizations are largely adapted from a great online (and free) book on interpretable AI by Christoph Molnar: “Interpretable Machine Learning, A Guide for Making Black Box Models Explainable”
Intrinsic vs. Post-hoc interpretability
There are two types of models under this category:
Models that are intrinsically interpretable because of their simple structure such as linear regression and short decision trees.
Models that are interpreted post hoc-after they have been trained. For instance, permutation feature importance looks at the decrease in feature score when single feature values are randomly shuffled. Shuffling individual features breaks the relationship between the feature and the output, and so the drop in the feature score indicates the importance of a feature in a model’s performance.
Model Specific vs. Model Agnostic
Model specific interpretations are interpretation tools that are specific to models or classes of models. For example, the interpretation of weights in linear or logistic regression is specific to those models, it does not make much sense in the context of decision trees or neural networks.
Model agnostic interpretation tools are tools that work for any machine learning model and are typically applied after a model is trained. These tools usually don’t have access to the internal structures and workings of the model such as model weights. These techniques include methods such as Partial Dependence Plots (PDP), LIME and Shapley Values.
Global vs. Local Interpretability
Global interpretability refers to interpretations that apply to the entire model (as opposed to a single prediction). It’s “big-picture” interpretability of the entire model and is about making sense of how a model makes decisions as a whole, based on its parameters and structure. Global interpretation sheds light on how the outcomes are distributed as a function of the features and is really hard to achieve in practice, especially for large and complex models with a lot of parameters.
Local interpretability tools explain a single prediction or a subset of predictions. Explaining a single prediction isolated from the entire model might be easier because a single prediction may depend linearly or monotonically on certain features as opposed to analyzing the relationship between all features and outcomes. So, often, local interpretations may be more accurate than global interpretations. Similarly, predictions for a group of outcomes can also be explained locally by treating the group as a complete dataset. Global methods can be applied to this subset to understand the model at a group-level and local methods can be applied to single predictions within this group.
How to test interpretable models for how good they are?
Although there is no clear consensus on how to measure interpretability, three main types of tasks have been proposed to to start getting at an evaluation. These categories of evaluation were proposed by Doshi-Velez and Kim (2017).
Testing interpretable models on real tasks (real humans, real tasks) : This involves adding the interpretability tools into a real application and then testing how good the tool is when used by the end-user (specifically a domain expert). A key baseline for this type of task is how good a human would be at explaining the same decision.
Testing interpretable models on simplified tasks (real humans, simplified tasks): This involves testing the interpretation on lay people instead of domain experts to gauge how well an ordinary human can understand the interpretations of a model. This approach is suitable to test general ideas of the quality of explanations using experiments that preserve the general idea of the target application. Moreover, this type of testing is usually quicker and cheaper as you don’t need to find domain experts and can test with almost anyone.
**Testing using proxy tasks (no humans, proxy tasks): **These experiments use some formal definition of interpretability (established previously) as a proxy for the quality of explanation. These experiments are suitable to test interpretability of models that have already been validated typically through human evaluation. They maybe a good testing option when a method is still too new or when human subject experiments maybe inaccessible (time and cost constraints) or unethical.
What is a good explanation?
An explanation usually answers a “why” question. Christoph Molnar provides an in-depth take on what good explanations are in the context of AI and ML. The key ideas from his posts are that:
Good explanations are contrastive and allow humans/users to compare differences between whatever instance/data point they choose as a reference and another instance.
Good explanations tend to be selective- they provide a few key reasons for a prediction from a variety of possibilities. They are able to tell a compelling story that effectively picks out key signals from the noise.
Good explanations are embedded in the social context of their application. An explanation is usually strongly tied to the domain where it is being used and so should be catering to that domain.
Good explanations account for abnormal/unexpected relationships between the input features and the outcomes. For example, if a rare category of a categorical feature is influencing the output, it should be included in the explanation
Good explanations should be truthful/high fidelity. This is in conflict with selectivity (point #2) as selectively may often remove part of the truth. However, selectivity seems to be more important than truthfulness (or rather presenting the whole truth). This is because there maybe a million truthful reasons that influence a specific outcome, however, humans often don’t want an exhaustive list of what causes an event. They usually want a few key reasons from a variety of possibilities.
Good explanations are consistent with the user’s prior beliefs as people usually tend to ignore information that does not conform to what they believe (confirmation bias). However, this maybe hard to enforce in ML models. Sometimes, a model may provide explanations that might contradict a user’s beliefs. In this case, it is important to inspect why the model is providing such an explanation (maybe it is due to bias in how the data was collected) and tell a compelling (and truthful) story that addresses the user’s contradictory viewpoint.
Good explanations are general and probable. A good explanation can typically explain many events unless the event maybe influenced by abnormal causes (rare scenarios, point #4).
References
Molnar, C. (2022). Interpretable Machine Learning: A Guide for Making Black Box Models Explainable (2nd ed.). christophm.github.io/interpretable-ml-book/