Interpretability methods
This tutorial is focused on explaining different interpretability methods for understanding the behavior of machine learning models. There have been tremendous research work on IML and researchers have proposed several methods to explain the working phenomenon of ML models. Different taxonomies of IML methods can also be found in the literature but with a lack in the consistency of taxonomies. Thus, for simplicity this tutorial summarize IML methods in two broad categories. The first involves designing ML models that are implicitly interpretable. These class of models are simple models such as decision trees which in itself is easy to interpret. The second method includes attempting to understand what a pre-trained model has learned from the underlying data to form a particular outcome or decision. This is called post-hoc analysis that takes a pre-trained model which is often black-box in nature, for example deep neural networks.
Towards designing interpretable models
In this approach, researchers aim to build solution to a given problem using ML models that do not require use of any post-hoc analysis once the model is trained, rather it focuses on building models in such a way that they are easy to interpret in themselves. Although these form of methods offer a good degree of explanability, which is encoded into the model itself, they often suffer in terms of performance due to the underlying simplicity of the model architecture that often fails to learn the underlying complex data distribution. This off-course depends and varies across different problem domains. Nonetheless, they are easy to understand which is a key to many safety-critical application domains, for example finance and medicine. During model training these form of models are conditioned to satisfy certain criterion in order to maintain interpretability. These conditions (for example sparsity) may take different forms depending upon the nature of the problem. They are often referred as white-boxes, intrinsic explainable models or transparent boxes. To derive an understanding of their working phenomenon, one can inspect different model components directly. For example, inspecting the different nodes visited from the root to the leaf node in a decision tree. Such analysis provides enough insights about why and how a model made a certain decision.
Approach 1: Rule-based models
The first category of methods aim at applying a predefined set of rules that are often mutually exclusive or dependent while training the models. One well known example of such model class is decision tree model which comprises set of if-else rules. Because of simplicity of if-else rules it becomes very easier to get an idea of how the model is forming a particular prediction. Researchers have proposed an extension to decision tree which is called as decision lists that comprises of an ordered set of if-then-else statements and these models take a decision whenever a particular rule holds true.
Approach 2: Case-based reasoning and prototype selection
In this approach, prototype selection and case-based reasoning are applied towards designing interpretable ML models. Here, prototype can mean different for various application and therefore it is application specific. For example, an average of N training examples from a particular class in the training dataset can be regarded as a prototype. Once trained, such model perform inference (or prediction) by computing the similarity of a test example with every element in the prototype set. Unsupervised clustering followed by prototype and subspaces learning have been performed by researchers to learn an interpretable Bayesian case model where each subspace is defined as a subset of features characterizing a prototype. Learning such prototypes and low-dimensional subspaces helps promote interpretability and generating explanations from the learned model.
Approach 3: towards building inherently interpretable models
In this approach, researchers aim at developing training algorithms and often defines dedicated model architecture in a way to bring interpretability in black-box machine learning models (especially the deep learning based). In that direction, one common and quite popular method used in the literature to promote interpretability is through use of attention algorithms during model training. Through such attention mechanism one can encode some degree of explainability in the training process itself. In other words, it provides a way to weigh feature components in the input (that can be eventually visualized) to understand what part of the input is being utilized most heavily by the model in forming a particular prediction in contrast to other feature components. On the other note, researchers have also encapsulated a special layer within the deep neural network (DNN) architecture to train the model in an interpretable way for different machine learning tasks. The output from such a layer that provides different information (for example different parts of input) can later be utilized during inference time for explaining or understanding different class category.
Furthermore, use of some training tricks such as network regularization has also been performed in the literature to make convolutional neural network models more interpretable. Such a regularization guides the training algorithm in learning disentangled representation from the input which eventually helps model learn the weights (i.e the filters) that eventually learns more meaningful features. Some other line of work can be found where self-explainable DNNs have been proposed. This model architecture comprises of a encoder module, a parameterizer module and an aggregation function module.
It is to be noted, however, that the design of interpretable models is not favorable under every situations. While it is true that they provide inherent explainability due to their design choices but there are limitations or challenges with this approach. One challenge is the use of input features. What if the input features used in itself is hard for humans to understand? For example, Mel Frequency Cepstral Coefficients is one of the state-of-the-art features used in automatic speech recognition systems, and is not easily interpretable. This implies that the obtained explanations from the trained interpretable model would lack interpretability because of the choice of input features. Thus, as highlighted earlier, there is always a tradeoff between model complexity and model interpretability. Lower the model complexity, higher is the interpretability but lower would be model performance. In contrast, higher the model complexity lower is the interpretability (but generally offers better performance on a test dataset). In almost every domain applications (audio, video, text, images etc) high accuracy showing models are complex in nature. It is hard to achieve state-of-the-art performance on a given task using simplistic interpretable models for example a linear regression model because of its simplicity as it fails to learn the complex data distribution in the training dataset, and hence shows poor performance on a test set. Thus post-hoc methods have evolved and explored by researchers across many domains to understand what complex machine learning models are capturing from the input data to make predictions. The next section provides a brief introduction on post-hoc methods of interpretability.
Post-hoc interpretability methods
This class of interpretability method works on a pre-trained machine learning model. Here, the post-hoc interpretability methods aims at investigating the behavior of a pre-trained models using specially devised algorithms to perform explainability study. This means that this class of methods do not put any conditioning with regard to interpretability during the model training. Thus the models that are being investigated to understand their behavior using post-hoc approaches are usually complex deep learning models which are black-box in nature. These methods are broadly grouped into two parts.
First
class of methods aim at understanding the global or overall behavior
of machine learning models (deep learning models in particular). The
second class of methods focus on understanding the local behavior of
the models. For example, producing explanations to understand which
different features (among N
set
of features)
contributed
most to a particular prediction. It should also be noted that these
post-hoc methods can be applicable to any machine learning model (so
called model agnostic types) or it can be designed specifically for a
particular class of machine learning models (so called model
specific). The
next tutorial will be discussing more on the post-hoc
methods of model interpretability.
Relevant Links: ISSUU DEEPAI ORCID GITHUB QUORA

No comments:
New comments are not allowed.