XGBoost

XGBoost: A Scalable Tree Boosting System

Introduction

Designs introduced scalability to XGBoost
- A novel tree learning algorithm is for handling sparse data.
- A theoretically justified weighted quantile sketch procedure enables handling instance weights in approximate tree learning.

Tree Boosting In a Nutshell

Regularized Learning Objective
- For a givem data set with $n$ examples and $m$ features $D = {(x_i, y_i)} (\lvert D\lvert = n, x_i \in \mathbb{R}^m, y_i\in \mathbb{R})$
  \[\hat{y}_i = \phi(x_i) = \sum_{k=1}^K f_k(x_i), \quad f_k\in\mathcal{F}\]
  where $\mathcal{F} = {f(x) = w_{q(x)}} (q: \mathbb{R}^m \rightarrow T, w\in\mathbb{R}^T)$ is the space of regression trees.
- Minimize the regularized objective
  \[\mathcal{L}(\phi) = \sum_i l(\hat{y}_i, y_i) + \sum_k \Omega(f_k)\]
  where $\Omega(f) = \gamma T + \frac{1}{2}\lambda\lvert\lvert w \lvert\lvert^2$.
Gradient Tree Boosting
- Gradient Tree Noticed that the GTB model is trained in an additive manner, we thus add $f_t$ to minimize the objective function
  \[\mathcal{L}^{t} = \sum_{i=1}^n l(y_i, \hat{y}_i^{t-1} + f_t(x_i)) + \Omega(f_t)\]
  To accelerate the optimization, the above function can be approximated in second-order.
  \[\mathcal{L}^{(t)} \simeq \sum_{i=1}^n[l(y_i, \hat{y}^{t-1} + \partial_{\hat{y}^{t-1}}l(y_i, \hat{y}^{t-1}) f_t(x_i) + \frac{1}{2} \partial^2_{\hat{y}^{t-1}}l(y_i, \hat{y}^{t-1}) f_t^2(x_i))] + \Omega(f_t)\]
  Notice that $\partial_{\hat{y}^{t-1}}l(y_i, \hat{y}^{t-1})$ and $\partial^2_{\hat{y}^{t-1}}l(y_i, \hat{y}^{t-1})$ are first and second order gradient statistics on the loss function. And $\mathcal{L}^{(t)}$ can be further simplified to
  \[\tilde{\mathcal{L}}^{t} = \sum_{i=1}^n [\partial_{\hat{y}^{t-1}}l(y_i, \hat{y}^{t-1}) f_t(x_i) + \frac{1}{2} \partial^2_{\hat{y}^{t-1}}l(y_i, \hat{y}^{t-1}) f_t^2(x_i))] + \Omega(f_t)\]
  To simplify the equation, we may use $g_i = \partial_{\hat{y}^{t-1}}l(y_i, \hat{y}^{t-1})$ and $h_i = \partial^2_{\hat{y}^{t-1}}l(y_i, \hat{y}^{t-1})$.
  
  Define $I_j = {i\lvert q(x_i) = j}$ as the instance set of leaf j, we can further expand $\Omega$
  \[\begin{aligned} \mathcal{L}^{(t)} &= \sum_{i=1}^n[g_i f_t(x_i) + \frac{1}{2} h_i f_t^2(x_i))] + \gamma T + \frac{1}{2}\lambda\sum_{j=1}^T w_j^2 \\ &= \sum_{j=1}^T[(\sum_{i\in I_j} g_i)w_j] + \frac{1}{2}\sum_{j=1}^T[\sum_{i\in I_j} h_i]w_j^2 + \frac{1}{2}\lambda\sum_{j=1}^Tw_k^2 + \gamma T\\ &= \sum_{j=1}^T[(\sum_{i\in I_j} g_i)w_j + \frac{1}{2}(\sum_{i\in I_j}h_i + \lambda)w_j^2] + \gamma T \end{aligned}\]
  To derive that, we need to notice that $f_t(x_i)$ is actually associated with $w$, as mentioned in the paper, the final prediction ($\hat{y}i = \sum{t=1}^Tf_t(x_i)$) by summing up the score in the corresponding leaves (given by w). In other word, we have $w_j = \sum_{i\in I_j}f_t(x_i)$.
  
  Given a fixed structure $q(x)$, we have optimal weight $w^*_j$ of leaf $j$ by (Simply let $\frac{\partial \mathcal{L}^{(t)}}{\partial w_j} = 0$ and derive the $w_j$).
  \[w_j^* = \frac{\sum_{i\in I_j} g_i}{\sum_{i\in I_j}h_i + \lambda}\]
  Thus, we have optimal value that can be used to meaure the quality of a tree structure $q$.
  \[\tilde{\mathcal{L}}^{(t)}(q) = -\frac{1}{2}\sum_{j=1}^T\frac{(\sum_{i\in I_j} g_i)^2}{\sum_{i\in I_j} h_i+\lambda} + \gamma T\]
- Greedy algorithm to obtain the tree structure.
  
  Core idea: Starts from a single leaf and iteratively adds branches to the tree
  
  Assume that $I_L$ and $I_R$ are the instance sets of left and right nodes after the split. Letting $I = I_L \cup I_R$, then the loss reduction after the split is given by
  \[\mathcal{L}_{split} = \frac{1}{2}[\frac{(\sum_{i\in I_L}g_i)^2}{\sum_{i\in I_L}h_i + \lambda} + \frac{(\sum_{i\in I_R}g_i)^2}{\sum_{i\in I_R}h_i + \lambda} - \frac{(\sum_{i\in I}g_i)^2}{\sum_{i\in I}h_i + \lambda}] -\gamma\]

Split finding algorithm

Basic Exact Greedy Algorithm
- Algorithm 1: Exact Greedy Algorithm for Split Finding Input: $I$, instance set of current node
  Input: $d$, feature dimension
  gain $\leftarrow 0$
  $G\leftarrow \sum_{i\in I}g_i$, $H\leftarrow \sum_{i\in I}hi$
  for $k=1$ to $m$ do
  $\quad G_L\leftarrow 0$, $H_L\leftarrow 0$
  $\quad$for $j\,\,in\,\,sorted(I, \,\, by\,x_{jk})$ do
  $\qquad G_L\leftarrow + g_j$, $H_L\leftarrow H_L + h_j$
  $\qquad G_R\leftarrow G - G_L$, $H_R\leftarrow H - H_L$
  $\qquad score\leftarrow max(score, \frac{G_L^2}{H_L + \lambda} + \frac{G_R^2}{H_R + \lambda} - \frac{G^2}{H + \lambda})$
  $\quad$end
  end
  Output: Split with max score
Approximate Algorithm
- Algorithm 2: Approximate Algorithm for Split Finding
  for $k=1$ to $m$ do
  $\quad$Propose $S_k={s_{k1}, s_{k2}, \ldots, s_{kl}}$ by percentiles on feature $k$.
  $\quad$Proposal can be done per tree (global), or per split(local).
  end
  for $k=1$ to $m$ do
  $\quad G_{kv}\leftarrow = \sum_{j\in {j \lvert s_{k,v}\geq x_{jk} > s_{k,v-1}}}g_j$
  $\quad H_{kv}\leftarrow = \sum_{j\in {j \lvert s_{k,v} \geq x_{jk} > s_{k,v-1}}}h_j$
  end
Sparsity-aware Split Finding
- Algorithm 3: Sparsity-aware Split Finding
  Input: $I$, instance set of current node
  Input: $I_k = {i\in I\lvert x_{ik} \neq missing}$
  Input: $d$, feature dimension
  Also applies to the approximate setting, only collect statistics of non-missing entries into buckets gain $\leftarrow 0$.
  $G \leftarrow \sum_{i\in I}, g_i, H\leftarrow \sum_{i\in I}h_i$
  for $k=1$ to $m$ do
  $\quad$// enumerate missing value goto right
  $G_L\leftarrow 0, H_L\leftarrow 0$
  $\quad$for $j\,\,in\,\,sorted(I_k,\,\,ascent\,\,order\,\,by\,\,x_{jk})$ do
  $\qquad G_L\leftarrow G_L+g_j, \,H_l\leftarrow H_L + h_j$
  $\qquad G_R\leftarrow G-G_L, \,H_R\leftarrow H-H_L$
  $\qquad score\leftarrow max(score, \frac{G_L^2}{H_L + \lambda} + \frac{G_R^2}{H_R + \lambda} - \frac{G^2}{H + \lambda})$
  $\quad$end
  $\quad$// enumerate missing value goto left
  $\quad G_R\leftarrow 0, H_R\leftarrow 0$
  $\quad$for $j\,\,in\,\,sorted(I_k,\,\,ascent\,\,order\,\,by\,\,x_{jk})$ do
  $\qquad G_R\leftarrow G_R+g_j, \,H_R\leftarrow H_R + h_j$
  $\qquad G_L\leftarrow G-G_R, \,H_L\leftarrow H-H_R$
  $\qquad score\leftarrow max(score, \frac{G_L^2}{H_L + \lambda} + \frac{G_R^2}{H_R + \lambda} - \frac{G^2}{H + \lambda})$
  $\quad$end
  end

PREVIOUSSection 1.3

NEXTSHapley Additive exPlanations (SHAP)