SHapley Additive exPlanations (SHAP)

 

A Unified Approach to Interpreting Model Predictions

Introduction

  • Explanation model Viewing any explanation of a model’s prediction as a model itself.
  • Application of game theory Game theory results guaranteeing a unique solution apply to the entire class of additive feature attribution methods and propose SHAP values as a unified measure of feature importance that various methods approximate.

Addidtive Feature Attribution Methods

  • Definition 1
    Additive feature attribution methods have an explanation model that is a linear function of binary variables:

    \[g(z') = \phi_0 + \sum_{i=1}^M\phi_iz'_i\]

    where $z’\in{0, 1}^M$, $M$ is the number of simplified input features, and $\phi_i\in \mathbb{R}$.
    Noticed that methods with explanation models matching aforementioned definition will attribute an effect $\phi_i$ to each feature, and summing the effects of all feature attributions approximates the output $f(x)$ of the original model.

  • LIME
    LIME interprets individual model predictions based on locally approximating the model around a given prediction, and it refers to simplified inputs $x’$ as “interpretable inputs”, and the mapping $x=h_x(x’)$ converts a binary vector of interpretable inputs into the original input space. To find $\phi$, LIME will minimize the following objective function:

    \[\xi = \argmin_{g\in\mathcal{G}} L(f, g, \pi_{x'}) + \Omega(g)\]

    Faithfulness of the explanation model $g(z’)$ to the original model $f(h_x(z’))$ is enforced through the loss $L$ over a set of samples in the simplified input space weighted by the local kernel $\pi_{x’}$, and the $\Omega$ is used to penalize the complexity of $g$.

  • DeepLIFT
  • Layer-Wise Relevance Propagation
  • Classic Shapley Value Estimation
    • Shapley regression values
      • Definition:
        Feature importances for linear models in the presence of multicollinearity.
      • Algorithm:
        • An importance value will be assigned to each feature to represent the effect on the model prediction of including that feature.
          1. Train two models with and without feature $i$ respectively, $f_S$ and $f_{S\cup{i}}$.
          2. Compare the prediction of two models using current input

            \[f_{S\cup\{i\}}(x_{S\cup\{i\}}) - f_S(x_S)\]

            where $x_S$ represents the values of the input features in the set $S$ and $x_{S\cup{i}}$ represents the values of the input features in the set $S\cup{i}$.

          3. Calculate the Shapley values $\phi_i$, which is a weighted average of all possible differences.

            \[\phi_i = \sum_{S\subseteq F \setminus \{i\}} \frac{\lvert S\lvert!(\lvert F\lvert - \lvert S\lvert - 1)!}{\lvert F\lvert!} [f_{S\cup\{i\}}(x_{S\cup\{i\}}) - f_S(x_S)]\]
    • Shapley sampling values
      1. Applying sampling approximations to aforementioned $\phi_i$ formula.
      2. Approximating the effect of removing a variable from the model by integrating over samples from the training dataset.

Simple Properties Uniquely Determine Additive Feature Attributions

  • Property 1 (Local accuracy)

    \[f(x) = g(x') = \phi_0 + \sum_{i=1}^M\phi_ix_i'\]

    The explanation model $g(x’)$ matches the original model $f(x)$ when $x=h_x(x’)$, where $\phi_0 = f(h_x(0))$ represents the model output with all simplified inputs toggled off (i.e. missing)

  • Property 2 (Missingness)

    \[x'_i = 0 \Rightarrow \phi_i=0\]

    Missingness constrains features where $x’_i = 0$ to have no attributed impact.

  • Property 3 (Consistency)

    Let $f_x(z’) = f(h_x(z’))$ and $z’\setminus i$ denote setting $z’_i = 0$. For any two models $f$ and $f’$, if

    \[f'_x(z') - f_x'(z'\setminus i) \geq f_x(z') - f_x(z'\setminus i)\]

    for all inputs $z’ \in {0,1}^M$, then $\phi_i(f’,x) \geq \phi_i(f,x)$.

  • Theorem 1
    Only one possible explanation model $g$ follows three properties and aforementioned definition 1.

    \[\phi_i(f,x) = \sum_{z'\subseteq x'} \frac{\lvert z'\lvert!(M-\lvert z'\lvert-1)!}{M!}[f_x(z') - f_x(z'\setminus i)]\]

    where $\lvert z’\lvert$ is the number of non-zero entries in $z’$, and $z’\subseteq x’$ represents all $z’$ vectors where the non-zero entries are a subset of the non-zero entries in x’.