Regularization

Overview

Definition

In general: Any method to prevent overfitting or help the optimization.
Specifically: Additional terms in the training optimization objective to prevent overfitting or help the optimization.

Overfitting

Key: empirical loss and expected loss are different.
Smaller the data set, larger the difference between empirical loss and expected loss.
Larger the hypothesis class, easier to find a hypothesis that fits the difference between the two type of loss.
- hus has small training error but large test error (overfitting)
Larger data set helps
Throwing away useless hypotheses also helps (regularization)

Different views of regularization

Regularization as hard constraint

When $\Omega$ measured by some quantity $R$
\[\begin{aligned} \min_\theta &\hat{L}(\theta) = \frac{1}{n}\sum_{i=1}^nl(\theta,x_i,y_i) \\ &s.t. \quad R(\theta) \leq r \end{aligned}\]
For example, $l_2$ regularization. $R(\theta) \leq r$ can be expressed as $\lvert\theta\lvert^2_2\leq r^2$.

Regularization as soft constraint

The hard-constraint optimization is equivalent to soft-constraint
\[\begin{aligned} \min_\theta &\hat{L}(\theta) = \frac{1}{n}\sum_{i=1}^nl(\theta,x_i,y_i) + \lambda^* R(\theta) \end{aligned}\]
For some $\lambda^* > 0$
Lagrangian multiplier method.
\[\mathcal{L}(\theta. \lambda) := \hat{L}(\theta) + \lambda[R(\theta)-r]\]
Suppose $\theta^* \,$ is the optimal for hard-constraint optimization.
\[\theta^* = \arg\min_\theta \max_{\lambda\geq0}\mathcal{L}(\theta, \lambda) := \hat{L}(\theta) + \lambda[R(\theta)-r]\]
Suppose $\lambda^* \,$ is the corresponding optimal for max
\[\theta^* = \arg\min_\theta \mathcal{L}(\theta, \lambda^* ) := \hat{L}(\theta) + \lambda^* [R(\theta)-r]\]

Regularization as soft constraint

Bayesian View: Everything will be regarded as a distribution.
Prior over the hypotheses: $p(\theta)$
Posterior over the hypotheses: $p(\theta\lvert {x_i, y_i})$
Likelihood: $p({x_i, y_i}\lvert\theta)$
Bayesian Rule:
\[p(\theta\lvert\{x_i, y_i\}) = \frac{p(\theta)p(\{x_i, y_i\}\lvert\theta)}{p(\{x_i, y_i\})}\]
Maximum A Posteriori (MAP):
\[\max_\theta\log p(\theta\lvert\{x_i, y_i\}) = \max_\theta\log p(\theta) + \log p(\{x_i, y_i\}\lvert\theta)\]
To be noticed, $\log p(\theta)$ is the regularization term and $\log p({x_i, y_i}\lvert\theta)$ is the MLE loss term.
Example: $l_2$ loss with $l_2$ regularization is correspond to a normal likelihood $p(x,y\lvert\theta)$ and a normal prior $p(\theta)$.

Three Views

Typical choice for optimization: soft-constraint.
Hard-constraint preferred if
- Know the explicit bound $R(\theta) \leq r$.
- Soft-constraint causes trapped in a local minima while projection back to feasible set leads to stability.
Bayesian view preferred if
- Domain knowledge easy to represent as a prior

Examples of Regularization

Classical Regularization
Overview

Norm penalty
- $l_2$ regularization
- $l_1$ regularization
Robustness to noise
- Noise to the input
- Noise to the weights

$l_2$ Regularization

\[\min_\theta \hat{L}_R(\theta) = \hat{L}(\theta) + \frac{\alpha}{2}\lvert\theta\lvert^2_2\]

Effect on (stochastic) gradient descent
- Gradient of regularized objective
  \[\nabla \hat{L_R}(\theta) = \nabla \hat{L}(\theta) + \alpha\theta\]
- Gradient descent update
  \[\begin{aligned} \theta &= \theta - \eta\nabla\hat{L_R}(\theta) \\ &= \theta - \eta\nabla\hat{L}(\theta) - \eta\alpha\theta \\ &= (1-\eta\alpha)\theta - \eta\nabla\hat{L}(\theta) \end{aligned}\]
- Terminology: weight decay
Effect on the optimal solution
- Consider a quadratic approximation ($f(x) \approx f(x_0) + f’(x_0)(x-x_0) + \frac{f’‘(x_0)}{2}(x-x_0)^2 \,\,(x\approx x_0)$) around $\theta^* \,$. Notice that $H$ is the Hessian Matrix, $[\frac{\partial^2\mathcal{L}}{\partial w_i \partial w_j}]$
  \[\hat{L}(\theta) \approx \hat{L}(\theta^* ) + (\theta - \theta^* )^T \nabla \hat{L}(\theta^* ) + \frac{1}{2}(\theta - \theta^* )^TH(\theta - \theta^* )\]
- Since $\theta^* \,$ is optimal, $\nabla\hat{L}(\theta^* )=0$, we have,
  \[\begin{aligned} \hat{L}(\theta) \approx \hat{L}(&\theta^* ) + \frac{1}{2}(\theta - \theta^* )^TH(\theta - \theta^* ) \\ &\nabla\hat{L}(\theta) \approx H(\theta - \theta^* ) \end{aligned}\]
- Gradient of regularized objective ($\alpha\theta$ comes from the derivitive of the $l_2$ term, $\frac{\alpha}{2}\lvert\theta\lvert^2_2$)
  \[\nabla\hat{L_R}(\theta) \approx H(\theta - \theta^* ) + \alpha\theta\]
- On the optimal $\theta_R^* \,$
  \[\begin{aligned} 0 = \nabla\hat{L_R}&(\theta_R^* ) \approx H(\theta_R^* - \theta^* ) + \alpha\theta_R^* \\ &\theta_R^* \approx (H+\alpha I)^{-1}H\theta^* \end{aligned}\]
- Suppose $H$ has eigen-decomposition $H = QΛQ^T$
  \[\theta_R^* \approx (H+\alpha I)^{-1}H\theta^* = Q(Λ + \alpha I)^{-1}ΛQ^T\theta^*\]
- Effect: rescale along eigenvectors of $H$

$l_1$ Regularization

\[\min_\theta \hat{L}_R(\theta) = \hat{L}(\theta) + \alpha\lvert\theta\lvert_1\]

Effect on (stochastic) gradient descent
- Gradient of regularized objective, and the sign will be applied to each element in $\theta$
  \[\nabla \hat{L_R}(\theta) = \nabla \hat{L}(\theta) + \alpha\cdot sign(\theta)\]
- Gradient descent update
  \[\begin{aligned} \theta &= \theta - \eta\nabla\hat{L_R}(\theta) \\ &= \theta - \eta\nabla\hat{L}(\theta) - \eta\alpha\cdot sign(\theta) \end{aligned}\]
Effect on the optimal solution
- Consider a quadratic approximation around $\theta^* \,$. Notice that $H$ is the Hessian Matrix, $[\frac{\partial^2\mathcal{L}}{\partial w_i \partial w_j}]$
  \[\hat{L}(\theta) \approx \hat{L}(\theta^* ) + (\theta - \theta^* )^T \nabla \hat{L}(\theta^* ) + \frac{1}{2}(\theta - \theta^* )^TH(\theta - \theta^* )\]
- Since $\theta^* \,$ is optimal, $\nabla\hat{L}(\theta^* )=0$, we have,
  \[\begin{aligned} \hat{L}(\theta) \approx \hat{L}(&\theta^* ) + \frac{1}{2}(\theta - \theta^* )^TH(\theta - \theta^* ) \end{aligned}\]
- Further assume that $H$ is diagonal and positive ($H_{ii} > 0, \forall_i$)
- The regularized objective is (ignoring constants)
  \[\hat{L_R}(\theta) \approx \sum_i\frac{1}{2}H_{ii}(\theta_i - \theta_i^* )^2 + \alpha\lvert\theta_i\lvert\]
- The optimal $\theta_R^* \,$ approximate to,
  \[(\theta_R^* )_ i \approx \begin{cases} \max\{\theta_i^* - \frac{\alpha}{H_{ii}, 0}\}\quad if\,\,\theta_i^* \geq 0 \\ \min\{\theta_i^* + \frac{\alpha}{H_{ii}, 0}\}\quad if\,\,\theta_i^* \leq 0 \end{cases}\]
- Effect: induce sparsity
- Further assume that $H$ is diagonal
- Compact expression for the optimal $\theta_R^* \,$
  \[(\theta_R^* )_ i \approx sign(\theta_i^* )\max\{\lvert\theta_i^* \lvert - \frac{\alpha}{H_{ii}},0\}\]

Bayesian View
Gaussian likelihood → squared error.
Gaussian prior → L2-regularization.
Laplace prior → L1-regularization.
Laplace likelihood → absolute error.
Sigmoid likelihood → logistic loss.

Equivalence to weight decay

Suppose the hypothesis is $f(x) = w^Tx$, noise is $\epsilon \sim N(0,\lambda I)$
After adding noise, the loss is
\[\begin{aligned} &\quad\quad\,\,\,\, L(f) = \mathbb{E}_{x,y,\epsilon}[f(x+\epsilon)-y]^2 = \mathbb{E}_{x,y,\epsilon}[f(x) + w^T\epsilon -y]^2\\ &\implies L(f) = \mathbb{E}_{x,y,\epsilon}[f(x)-y]^2 + 2\mathbb{E}_{x,y,\epsilon}[w^T\epsilon(f(x)-y)]^2 + \mathbb{E}_{x,y,\epsilon}[w^T\epsilon]^2\\ &\implies L(f) = \mathbb{E}_{x,y,\epsilon}[f(x)-y]^2 + \lambda\lvert w\lvert^2 \end{aligned}\]

Add noise to the weights

For the loss on each data point, add a noise term to the weights before computing the prediction
\[\epsilon\sim N(0, \eta I),\,\,w' = w + \epsilon\]
Prediction: $f_{w’}(x)$ instead of $f_w(x)$
Loss becomes
\[L(f) = \mathbb{E}_{x,y,\epsilon}[f_{w+\epsilon}(x)-y]^2\]
We use Taylor expansion to simplify the function term.
\[f_{w+\epsilon}(x) \approx f_w(x) + \epsilon^T\nabla f_w(x)\]
Plug in loss function.
\[L(f) \approx \mathbb{E}_{x,y,\epsilon}[f_{w}(x)-y]^2 + 2\mathbb{E}_{x,y,\epsilon}[(f_{w}(x)-y)\epsilon^T\nabla f_w(x)] + \eta\mathbb{E}\lvert\nabla f_w(x)\lvert^2\]
- Regularization term: $\eta\mathbb{E}\lvert\nabla f_w(x)\lvert^2$
- Expectation of $\epsilon^T$ equals 0.

Other types of regularization

Data augmentation

Adding noise to the input: a special kind of augmentation
Be careful about the transformation applied:
- Example: classifying ‘b’ and ‘d’
- Example: classifying ‘6’ and ‘9’

Early stopping

Idea: don’t train the network to too small training error
- Recall overfitting: Larger the hypothesis class, easier to find a hypothesis that fits the difference between the two
- Prevent overfitting: do not push the hypothesis too much; use validation error to decide when to stop
When training, also output validation error
Every time validation error improved, store a copy of the weights
When validation error not improved for some time, stop
Return the copy of the weights stored
hyperparameter selection: training step is the hyperparameter
Advantage:
- Efficient: along with training; only store an extra copy of weights
- Simple: no change to the model/algo
Disadvantage: need validation data
Strategy to heuristically mitigate the disadvantage
- After early stopping of the first run, train a second run and reuse validation data
How to heuristically reuse validation data
1. Start fresh, train with both training data and validation data up to the previous number of epochs
2. Start from the weights in the first run, train with both training data and validation data until the validation loss is smaller than the training loss at the early stopping point

Dropout

Randomly select weights to update
More precisely, in each update step
- Dropout probability p, or present probability 1-p
- Randomly sample a different binary mask to all the input and hidden units
- Multiple the mask bits with the units and do the update as usual
During test time: all units present; multiply weight by 1-p
Typical dropout probability: 0.2 for input and 0.5 for hidden units

Batch Normalization
Algorithm
Input:

Values of $x$ over a mini-batch: $\mathcal{B} = {x_{1,…,m}}$;
Parameters to be learned: $\gamma$, $\beta$;

Output:
${y_i = BN_{\gamma, \beta}(x_i)}$
$\mu_\mathcal{B} \leftarrow \frac{1}{m}\sum_{i=1}^m x_i$ $\quad$//mini-batch mean
$\sigma_\mathcal{B} \leftarrow \frac{1}{m}\sum_{i=1}^m (x_i-\mu_\mathcal{B})^2$ $\quad$//mini-batch variance
$\hat{x_i}\leftarrow \frac{x_i-\mu_{\mathcal{B}}}{\sqrt{\sigma_\mathcal{B}^2+\epsilon}}$ $\quad$//normalize
$y_i\leftarrow \gamma\hat{x_i} + \beta \equiv BN_{\gamma, \beta}(x_i)$ $\quad$//scale and shift

Comments

First three steps are just like standardization of input data, but with respect to only the data in mini-batch. Can take derivative and incorporate the learning of last step parameters into backpropagation.
Note last step can completely un-do previous 3 steps.
But if so this un-doing is driven by the later layers, not the earlier layers; later layers get to “choose” whether they want standard normal inputs or not.

PREVIOUSBack Propagation

NEXTSection 1.2