Section 1.3

Chapter 1 Statistical Models, Goals, And Performance Criteria

1.3 The Decision Theoretic Framework

Decision Theoretic Framework

Basic elements of a decesion problem
- Estimation
  Estimating a real parameter $\theta\in\Theta$ using data $X$ with conditional distribution $P_\theta$.
- Testing
  Given data $X\sim P_\theta$, choosing between two hypotheses (deciding whether to accept or reject $H_0$)
  \[\begin{aligned} &H_0: P_\theta \in P_0 \\ &H_1: P_\theta \notin P_0 \end{aligned}\]
- Ranking
  rank a collection of items from best to worst
  Examples:
  - Products evaluated by consumer interest group
  - Sports betting (horse race, team tournament)
- Prediction Predict response variable $Y$ given explanatory variables $Z=(Z_1, Z_2, \ldots, Z_d)$
  - If know joint distribution of $(Z,Y)$, use $\mu(Z) = E[Y\lvert Z]$
  - With data ${(z_i,y_i), i=1,2,\ldots, n}$, estimate $\mu(Z)$. If $\mu(Z) = g(\beta, Z)$, then use $\hat{\mu}(Z) = g(\hat{\beta}, Z)$.
- $\Theta = {\theta}$: The “State Space”
  $\theta =$ state of nature (unknown uncertainty element in the problem)
- $\mathcal{A} = {a}$: The “Action Space”
  $a=$ action taken by statistician
- $\mathcal{L}(\theta, a)$: The “Loss Function”
  - $\mathcal{L}(\theta, a)=$ loss incurred when state is $\theta$ and action a taken.
  - $\mathcal{L}: \Theta \times\mathcal{A}\rightarrow\mathcal{R}$
Additional Elements of a Statistical Decision Problem
- $X \sim P_\theta$: Random Variable (Statistical Observation)
  - Conditional distribution of $X$ given $\theta$.
  - Sample space $\mathcal{X} = {x}$
  - Density/pmf function of conditional distribution:
    \[f(x\lvert\theta) \quad or\quad f_X(x\lvert\theta)\]
- $\delta(X)$: A “Decision Procedure”
  - Observe data $X = x$ and take action $a\in\mathcal{A}$
  - $\delta(\cdot):\mathcal{X}\rightarrow \mathcal{A}$
- $D$: Decision Space (class of decisino procedures)
  - $D =$ {decision procedures $\delta: \mathcal{X}\rightarrow\mathcal{A}$}
- $R(\theta, \delta)$: Risk Function (performance measure of $\delta(\cdot)\lvert\theta)$
  - $R(\theta, \delta) = E_X[\mathcal{L}(\theta, \delta(X))\lvert\theta]$
  - Expectation of loss incurred by decision procedure $\delta(X)$ when $\theta$ is true.
  - For no-data problem (no X), $R(\theta, a) = \mathcal{L}(\theta, a)$
Additional Elements of a Bayesian Decision Problem
- $\theta\sim\pi$: Prior Distribution for parameter $\theta\in\Theta$
- $r(\pi, \delta)$: Bayes Risk of $\delta$ given prior distribution $\pi$
  - $r(\pi, \delta) = E_{\theta }R(\theta^, \delta(X))$, taking expectation with respect to $\theta^*\sim\pi$
- *Bayes rule $\delta^$*: Decision procedure that minimizes the Bayes risk $r(\pi, \delta^) = \min_{\delta\in D} r(\pi, \delta)$

Example of Statistical Decision Problems

Statistical Estimation Problem
- Given:
  - $X\sim P_\theta = N(\theta,1), \quad -\infty<\theta<\infty$
  - $\mathcal{A} = \Theta = \mathcal{R}$
  - Squared-error Loss:
    \[\mathcal{L}(\theta, a) = (a-\theta)^2\]
  - Decision procedure: for finite constant $c: 0<c\leq 1$
    \[\delta_c(X) = cX\]
  - Risk function:
    \[\begin{aligned} R(\theta, \delta_c) &= E_X[(\delta(X) - \theta)^2 \lvert \theta] \\ &= Var(\delta(x)) + [E_X[\delta(x)\lvert \theta] - \theta]^2 \\ &= c^2 +(c-1)^2\theta^2 \end{aligned}\]
  - Special cases:
    - $\delta_1(X) = X:\quad R(\theta, \delta_1) =1$ (independent of $\theta$)
    - $\delta_0(X) \equiv X:\quad R(\theta, \delta_0) = \theta^2$ (zero at $\theta = 0$, unbounded)
    - $\theta_{0.5}(X) = \frac{X}{2}: \quad R(\theta, \delta_{0.5}) = \frac{1}{4} \times (1+\theta^2)$
  - Mean-Squared Error: Estimation Risk (Squared-Error Loss)
    - $X\sim P_\theta, \theta\in\Theta$
    - Parameter of interest: $v(\theta)$ (some function of $\theta$)
    - Action Space: $\mathcal{A} = { v=v(\theta), \theta\in\Theta }$
    - Decision procedure/estimator: $\hat{v}(X):\mathcal{X}\rightarrow\mathcal{A}$
    - Squared Error Loss: $\mathcal{L}(\theta, a) = [a - v(\theta)]^2$
    - Risk equal to Mean-Squared Error:
      \[\begin{aligned} R(\theta, \hat{v}(X)) &= E[L(\theta, \hat{v}(X)) \lvert \theta] \\ &= E[(\hat{v}(X) - v(\theta))^2\lvert \theta] = MSE(\hat{v}) \end{aligned}\]
  - Proposition 1.3.1 For an estimator $\hat{v}(X)$ of $v(\theta)$, the mean-squared error is
    \[MSE(\hat{v}) = Var[\hat{v}(X)\lvert \theta] + [Bias(\hat{v} \lvert \theta)]^2\]
    where $Bias(\hat{v}\lvert\theta) = E[\hat{v}(X) \theta] - v(\theta)$
    
    Definition: $\hat{v}$ is Unbiased if $Bias(\hat{v} \lvert \theta) = 0$ for all $\theta\in\Theta$
Statistical Testing Problem (Two-Sample Problem)
- Given:
  - $X_1, \ldots, X_m$ i.i.d. $\mathcal{N}(\mu, \sigma^2)$, response under control treatment
  - $Y_1, \ldots, Y_N$ i.i.d. $\mathcal{N}(\mu + \Delta, \sigma^2)$, response under test treatment
    - $\mu \in R, \sigma^2\in R_+$ unknown
    - $\Delta\in R$, is unknown treatment effect
- Let $P(X,Y\lvert\mu, \Delta, \sigma^2)$ denote the joint distribution of $X = (X_1, \ldots, X_m)$ and $Y=(Y_1, \ldots, Y_n)$
- Define two hypotheses:
  - $H_0: P\in{ P:\Delta=0 } = {P_\theta, \theta \in \Theta_0 }$
  - $H_1: P\in {P:\Delta\neq0} = { P_\theta, \theta \notin \Theta_0 }$
- $\mathcal{A} = {0,1}$ with 0 corresponding to accepting $H_0$ and 1t to rejecting $H_0$.
- Construct decision rule accepting $H_0$ if estimate of $\Delta$ is significantly different from zero, e.g.,
  $\hat{\Delta} = \bar{Y} - \bar{X}$ (difference in sample means)
  $\hat{\sigma}$: an estimate of $\sigma$
  \[\delta(X,Y) = \begin{cases} 0\quad if \quad\lvert\frac{\hat{\Delta}}{\hat{\sigma}}\lvert <c \quad (criticle\,\, value) \\ 1\quad if \quad\lvert\frac{\hat{\Delta}}{\hat{\sigma}}\lvert \geq c \end{cases}\]
- Zero-one Loss function
  \[L(\theta, a) = \begin{cases} 0\quad if \quad \theta\in\Theta_a\quad (correct\,\, action) \\ 1\quad if \quad \theta\notin\Theta_a\quad (wrong\,\, action) \end{cases}\]
- Risk function: Take as the measure of performance of decision rule $\delta(X)$.
  \[\begin{aligned} R(\theta, \delta) &= E_P[l(P, \delta(X))] \\ &= L(\theta, 0)P_\theta(\delta(X,Y) = 0) + L(\theta, 1)P_\theta(\delta(X,Y) = 1) \\ &= P_\theta(\delta(X,Y) = 1), \quad if\quad \theta\in\Theta_0 \\ &= P_\theta(\delta(X,Y) = 0), \quad if\quad \theta\notin\Theta_0 \end{aligned}\]
- Terminology of Statistical Testing
  - Critical Region of a test $\delta(\cdot)$
    \[C = \{x: \delta(x) = 1\}\]
  - Type I Error
    $\delta(X)$ rejects $H_0$ when $H_0$ is true.
  - Type II Error
    $\delta(X)$ rejects $H_0$ when $H_0$ is false.
  - Neyman-Pearson framework
    Constrained optimization of risks:
    Minimize: P(Type II Error) Subject to: P(Type I Error) $\leq \alpha$ (“significance level”)

Interval Estimation and Confidence Bounds

Value-at-Risk (VAR)
- Let $X_1, X_2, \ldots$ be the change in value of an asset over independent fixed holding periods and suppose they are i.i.d. $X \sim P_\theta$ for some fixed $\theta\in\Theta$.
- For $\alpha = 0.05$, say, define $VAR_{\alpha}$ (the level-$\alpha$ Value-at-Risk) by $P(X \leq -VAR_{\alpha} \lvert) = \alpha$
- Consider estimating the VAR of $X_{n+1}$ given $X=(X_1, \ldots, X_n)$
  Determine an estimator $\hat{VAR}(X)$:
  \[P_{\theta}(X \leq -\hat{VAR}(X)) \leq \alpha\]
  for all $\theta\in\Theta$.
- The outcome $X_{n+1}$ exceeds $VAR_{\alpha}$ to the downside with probability no greater than $\alpha (= 0.05)$
Lower-Bound Estimation
- $X\sim P_\theta, \theta \in \Theta$
- Parameter of interest: $v(\theta)$
- Action Space: $\mathcal{A} = { v=v(\theta), \theta\in\Theta }$
- Estimator: $\hat{v}(X): \mathcal{X}\rightarrow\mathcal{A}$
- Objective: bounding $v(\theta)$ from below
- Lower-Bound Estimator: $\hat{v}(X)$ is good if
  - $P_\theta(\hat{v}(X) \leq v(\theta))$ has high probability
  - $P_\theta(\hat{v}(X) > v(\theta))$ has low probability $\Rightarrow$ Define the loss function
  - $L(\theta, a) = 1$, if $a>v(\theta)$; zero otherwise.
- Risk function under zero-one loss $L(\theta, a)$:
  \[R(\theta, \hat{v}(X)) = E[L(\theta, \hat{v}(X))\lvert \theta] = P_\theta(\hat{v}(X) > v(\theta))\]
- The Lower-Bound Estimator $\hat{v}(X)$ has Confidence Level $(1-\alpha)$ if
  \[P_\theta(\hat{v}(X) \leq v(\theta)) \geq 1-\alpha,\]
  for all $\theta \in \Theta$.

Interval (Lower and Upper Bound) Estimation

$X \sim P_\theta, \theta\in \Theta$
Parameter of interest: $v(\theta)$
Define $\mathcal{V} = {v = v(\theta), \theta\in\Theta}$
Objective: Interval estimation of $v(\theta)$
Action Space: $\mathcal{A} = {\mathbb{a} = [a_{lower}, a_{upper}]: a\in\mathcal{V}}$
Estimator: $\begin{aligned} &\hat{v}(X): \mathcal{X}\rightarrow\mathcal{A} \\ &\hat{v}(X) = [\hat{v}_{LOWER}(X), \hat{v}_{UPPER}(X)] \end{aligned}$

Interval Estimator: $\hat{v}(X)$ is good if

$P_\theta(\hat{v}{LOWER}(X)\leq v(\theta) \leq \hat{v}{UPPER}(X))$ is high

$P_\theta(\hat{v}_{LOWER}(X) > v(\theta)

v(\theta) \leq \hat{v}_{UPPER}(X))$ is low

Noted that if $\theta$ is non-random; we will need Bayesian models to finish the calculation.

Define the loss function $\begin{aligned} L(\theta, (a_{lower}, a_{upper})) &= 1, \,if\, a_{lower}>v(\theta) \,or\, \bar{a} < v(\theta) \\ &= 0, \,otherwise \end{aligned}$
Risk function under zero-one loss $L(\theta, a)$: $\begin{aligned} R(\theta, \hat{v}(X)) &= E[L(\theta, \hat{v}(X)) \lvert \theta] \\ &= P_\theta(\hat{v}_{Lower}(X) > v(\theta) \,or\, \hat{v}_{Upper}(X) < v(\theta)) \\ &= 1 - P_{\theta}(\hat{v}_{Lower}(X) \leq v(\theta) \leq \hat{v}_{Upper}(X) \lvert \theta) \end{aligned}$
The Interval Estimator $\hat{v}(X)$ has Confidence Level $(1-\alpha)$ if $P_\theta(\hat{v}_{Lower}(X) \leq v(\theta) \leq \hat{v}_{Upper}(X) \lvert \theta) \geq (1-\alpha)\,for\,all\,\theta\in\Theta$ Equivalently: $R(\theta, \hat{v}(X)) \leq \alpha, \,for\,all\,\theta\in\Theta.$

Choosing Among Decision Procedures
- Admissible/Inadmissible A decision procedure $\delta(\cdot)$ is inadmissible if $\exists\delta’$ such that $R(\theta, \delta’) \leq R(\theta, \delta)$ for all $\theta \in\Theta$ with strict inequality for some $\theta$.
- Objectives:
  - Restrict $\mathcal{D}$ to exclude inadmissible decision procedures.
  - Characterize “Complete Class” (all admissible procedures).
  - Formalize ‘best’ choice amongst all admissible procedures.
Approaches to Decision Selection
- Two risk functions based on global criteria
  - Bayes Risk
    - Elements
      - Basic Elements of Decision Problem
        
        $X\sim P_\theta$: R.V.
        
        $\delta(X)$: Decision Procedure
        
        $\mathcal{D}$: Decision Space
        
        $R(\theta, \delta)$: Risk Function
      - Additional Elements of Bayesian Decision Problem
        
        $\theta \sim \pi$: Prior Distribution for parameter $\theta\in\Theta$
        
        $r(\pi, \delta)$: Bayes Risk of $\theta$ given prior distribution $\pi$
        
        Bayes rule $\delta^*$: Decision procedure that minimizes the Bayes risk
    - Computation
      - Discrete priors $r(\pi, \delta) = \sum_\theta \pi(\theta)R(\theta, \delta)$
      - Continuous priors $r(\pi, \delta) = \int_\Theta \pi(\theta)R(\theta, \delta)d\theta$
    - Identifying Bayes Procedures
      - Posterior analysis specifies Bayes rules directly
      - Apply Posterior Distribution of $\theta$ given $X$ to minimize risk a posterior.
  - Maximum Risk (minimax approach)
    - Minimax Criterion
      - Prefer $\delta$ to $\delta’$ if $\sup_{\theta\in\Theta}R(\theta, \delta) < \sup_{\theta\in\Theta} R(\theta, \delta')$
      - A procedure $\delta^*$ is called minimax if $\sup_{\theta\in\Theta}R(\theta, \delta^*) = \inf_{\delta\in\mathcal{D}}\sup_{\theta\in\Theta}R(\theta, \delta^*)$

PREVIOUSSection 1.2

NEXTXGBoost