Ensemble Method

Ensemble

Definition: A set of learned models whose individual decisions are combined in some way to make predictions for new instances.
When ensemble does a better job? The errors made by the individual predictors are (somewhat) uncorrelated, and the predictors’ error rates are better than guessing (< 0.5 for 2-class problem)
Diverse classifiers Encourage diversity in their errors by
- choosing a variety of learning algorithms
- choosing a variety of settings (e.g. # hidden units in neural nets) for the learning algorithm
- choosing different subsamples of the training set (bagging)
- using different probability distributions over the training instances (boosting, skewing)
- choosing different features and subsamples (random forests)

Bagging

(Bootstrap Aggregation)

Algorithm:
- Learning:
  - Given: Learner $L$, training set $D = {(x_1,y_1) \ldots (x_m,y_m)}$
  - Action: for $i \leftarrow 1$ to $T$:
    $\quad\quad$ $D^{(i)} \leftarrow m$ instances randomly drawn with replacement from $D$.
    $\quad\quad$ $h_i \leftarrow$ model learned using $L$ on $D^{(i)}$
- Predition:
  - Classification:
    - Given: Test instance $x$
    - Action: predict $y \leftarrow$ plurality_vote($h_1(x), \ldots h_T(x)$)
  - Regression:
    - Given: Test instance $x_t$
    - Action: predict $y \leftarrow$ plurality_vote($h_1(x), \ldots h_T(x)$)
Comments
- each sampled training set is a bootstrap replicate
  - contains m instances (the same as the original training set)
  - on average it includes 63.2% of the original training set
  - some instances appear multiple times
- can be used with any base learner
- works best with unstable learning methods: those for which small changes in D result in relatively large changes in learned models, i.e., those that tend to overfit training data.

Boosting

Intuition
- Boosting came out of the PAC learning community.
- A weak PAC learning algorithm is one that cannot PAC learn for arbitrary ε and δ, but it can for some: its hypotheses are at least slightly better than random guessing.
- Suppose we have a weak PAC learning algorithm L for a concept class C. Can we use L as a subroutine to create a (strong) PAC learner for C?
  - Yes, by boosting! [Schapire, Machine Learning 1990]
  - The original boosting algorithm was of theoretical interest, but assumed an unbounded source of training instances.
- A later boosting algorithm, AdaBoost, has had notable practical success.
Algorithm of AdaBoost
- Given: Learner $L$, # stages $T$ (or say iteration number), training set $D = {(x_1,y_1) \ldots (x_m,y_m)}$
- Pseudocode: ($i$ denotes the $i^{th}$ instance)
  for all $i: w_1(i) \leftarrow \frac{1}{m}$ $\quad$// initialize instance weights
  for $t\leftarrow 1$ to $T$ do
  $\quad\quad$for all $i: p_t(i) \leftarrow \frac{w_t(i)}{\sum_jw_t(j)} \quad$// normalize weights
  $\quad\quad h_t \leftarrow$ model learned using $L$ on $D$ and $p_t$
  $\quad\quad$ $\epsilon_t \leftarrow \sum_ip_t(i)(1-\delta(h_t(x_i), y_i)$ $\quad$// calculate weighted error
  $\quad\quad$if $\epsilon_t>0.5$ then
  $\quad\quad \quad\quad T\leftarrow t-1$
  $\quad\quad \quad\quad$break
  $\quad\quad \beta_t \leftarrow \frac{\epsilon_t}{1-\epsilon_t} \quad$ // lower error, smaller $\beta_t$
  $\quad\quad$for all $i$ where $h_t(x_i) = y_i \quad$ // downweight correct examples
  $\quad\quad$$\quad\quad w_{t+l}(i) \leftarrow w_t(i)\beta_t$
  return:
  $\quad\quad$ $h(x) = \arg\max_y\sum_{t=1}^T(\log{\frac{1}{\beta_t}})\delta(h_t(x), y)$
- Comments:
  - $\delta()$ is the error function, $E(h(x), i, y) = e^{-y_ih(x_i)}$
  - Implementing weighted instances with AdaBoost
    - AdaBoost calls the base learner $L$ with probability distribution $p_t$ specified by weights on the instances.
    - There are two ways to handle this
      1. Adapt $L$ to learn from weighted instances; straightforward for decision trees and naïve Bayes, among others
      2. Sample a large (» m) unweighted set of instances according to $p_t$; run $L$ in the ordinary manner

Bagging vs. Boosting

Bagging almost always better than single decision tree or neural net
Boosting can be much better than bagging
Boosting can sometimes reduce accuracy (too much emphasis on outliers?)

Random forests

Algorithm:
- Learning:
  - Given: Candidate feature splits $F$, training set $D = {(x_1,y_1) \ldots (x_m,y_m)}$
  - Action: for $i \leftarrow 1$ to $T$:
    $\quad\quad D^{(i)} \leftarrow m$ instances randomly drawn with replacement from $D$.
    $\quad\quad h_i \leftarrow$ randomized decision tree learned with $F, D^{(i)}$
- Randomized decision tree learning:
  - To select a split at a node
    $\quad\quad R ←$ randomly select (without replacement) $f$ feature splits from $F$ (where $f \approx \sqrt{|F|}$)
    $\quad\quad$ choose the best feature split in $R$.
  - do not prune trees.
- Predition:
  - Classification/Regression:
    As in bagging

PREVIOUSSupport Vector Machine

NEXTSection 1.1