Section 1.1

 

Chapter 1 Statistical Models, Goals, And Performance Criteria

1.1 Data, Models, Parameters and Statistics

Definition of Statistical Model

  • Random experiment with sample space $\Omega$.
  • Random vector $X=(X_1, \ldots, X_n)$ defined on $\Omega$.
    • $\omega \in \Omega$: outcome of experiment
    • $X(\omega)$: data observations
  • Probability distribution of $X$
    • $\mathcal{X}$: Sample Space = ${outcomes \,\,x}$
    • $\mathcal{F}_X$: Sigma-field of measurable events.
    • $P(\cdot)$ defined on $(\mathcal{X}, \mathcal{F}_X)$
  • Statistical Model
    • $\mathcal{P} =$ {family of distribution}

Review of STAT609

  • Sigma-Field $\mathcal{F}$: A collection $\mathcal{F}$ of subsets of $\Omega$ is called a $\sigma$-field, which meets following conditions:
    • $(\emptyset = \bar{\Omega}) \in \mathcal{F}$.
    • If $A_1 \in \mathcal{F}, A_2\in\mathcal{F}, \ldots$, then $\cup_{i=1}^\infty A_i\in\mathcal{F}$. In other words, this means that $\mathcal{F}$ is closed under countable union.
    • If $A\in\mathcal{F}$, then $\bar{A}\in\mathcal{F}$.

Definition of Parameters/ Parametrization

  • Parameter $\theta$ identifies/specifies distribution in $\mathcal{P}$.
  • $\mathcal{P} = {P_\theta, \theta\in\Theta}$
  • $\Theta={\theta}$, the Parameter Space
  • Ensure indentificable, that us $\theta_1\neq\theta_2 \Rightarrow P_{\theta_1} \neq P_{\theta_2}$. Identifiable Def.

Example: One-Sample Model
Given conditions:

  • $X_1, \ldots, X_n$ i.i.d. with distribution function $F(\cdot)$
  • Probability Model: $\mathcal{P}$ = {distribution function $F(\cdot)$}
  • Measurement Error Model: $X_i = \mu + \epsilon_i, \quad i=1,2,\ldots, n$
    $\mu$ is constant parameter (e.g., real-valued, positive) $\epsilon_1, \epsilon_2, \ldots, \epsilon_n$ i.i.d. with distribution function $G(\cdot)$, notice that $G$ does not depend on $\mu$.

Therefore, we can imply that $X_1, \ldots, X_n$ i.i.d. with distribution function $F(x) = G(x-\mu)$. And we also have that $\mathcal{P} = {(\mu, G): \mu\in R, G\in\mathcal{G}}$. Depends on the class of $\mathcal{G}$, we can divide this one-sample model into following cases:

  • Parametric Model: Gaussian measurement errors ${\epsilon_j}$ are i.i.d. $N(0, \sigma^2)$, with $\sigma^2>0$, but the exact value of $\sigma$ is unknown.
  • Semi-Parametric Model: Symmetric measurement-error distributions with mean $\mu$.
    ${\epsilon_j}$ are i.i.d. with distribution function $G(\cdot)$, where $G\in\mathcal{G}$, the class of symmetric distributions with mean 0.
  • Non-Parametric Model: $X_1, \ldots, X_n$ are i.i.d. with distribution function $G(\cdot)$ where $G\in\mathcal{G}$, the class of all distributions on the sample space $\mathcal{X}$ (with center $\mu$).

Example: Two-Sample Model
Given conditions:

  • $X_1, \ldots, X_n$ i.i.d. with distribution function $F(\cdot)$
  • $Y_1, \ldots, Y_m$ i.i.d. with distribution function $G(\cdot)$
  • Probability Model: $\mathcal{P} = {(F,G), F\in\mathcal{F}, G\in\mathcal{G}}$. Specific cases relate $\mathcal{F}$ and $\mathcal{G}$.
  • Shift Model with parameter $\delta$
    • ${X_i}$ i.i.d. $X\sim F(\cdot)$, response under Treatment A.
    • ${Y_i}$ i.i.d. $Y\sim G(\cdot)$, response under Treatment B.
    • $Y=X+\delta$, i.e., $G(v) = F(v-\delta)$
    • $\delta$ is the difference in response with Treatment B instead of Treatment A, and $\delta$ does not depend on X(or say A).

Modeling Issues

  • Non-uniqueness of parametrization
  • Varing complexity of equivalent parametrizations
  • Possible non-identificability of parameters
  • Parameters “of interest” vs “Nuisance ”parameters
  • A vector parametrization that is unidentifiable may have identifiable components.
  • Data-based model selection How does using the data to select among models affect statistical inference?
  • Data-based sampling procedures How does the protocol for collecting data observations affect statistical inference?

Regular Models

  • Notations:
    • $\theta$: A parameter specifying a probability distribution $P_\theta$.
    • $F(\cdot \lvert\theta)$: Distribution function of $P(\theta)$
    • $E_\theta[\cdot]$: Expectation under the assumption $X\sim P_\theta$. For a measurable function $g(X)$, \(E_\theta[g(X)] = \int_\mathcal{X}g(x)dF(x\lvert\theta)\)
    • $p(x\lvert\theta) = p(x; \theta)$: densit or probability-mass function of $X$
  • Assumptions:
    • Either All of the $P_\theta$ are continuous with densities $p(x\lvert\theta)$, Or All of the $P_\theta$ are discrete with pmf’s $p(x\lvert\theta)$
    • The set ${x: p(x\lvert\theta) > 0}$ is the same for all $\theta \in \Theta$, that is the aforementioned set should be indepedent of $\theta$.

Regression Models
Given: $n$ cases $i=1,2,\ldots, n$

  • 1 Response (dependent) variable \(y_i, \,i=1,2,\ldots, n\)
  • $p$ Explanatory (independent) variables \(x_i = (x_{i,1}, \ldots, x_{i,p})^T, \,i=1,2,\ldots n\)

Goal of Regression Analysis:

  • Extract/exploit relationship between $y_i$ and $x_i$.

Step for fitting a model:

  1. Propose a model in terms of
    • Response varibale $Y$
    • Explanatory variables $X_1, \ldots, X_p$
    • Assumptions about the distribution of $\epsilon$ over the cases.
  2. Specify/define a criterion for judging different estimators.
  3. Characterize the best estimator and apply it to the given data.
  4. Check the assumptions in (1).
  5. If necessary modify model and/or assumptions and go to (1).

Specifying Assumptions in (1) for Residual Distribution:

  • Gauss-Markov: zero mean, constant variance, uncorrelated Normal-linear models: $\epsilon_i$ are i.i.d. $N(0, σ^2)$ r.v.s
  • Generalized Gauss-Markov: zero mean, and general covariance matrix (possibly correlated,possibly heteroscedastic)
  • Non-normal/non-Gaussian distributions (e.g., Laplace, Pareto, Contaminated normal: some fraction $(1 − δ)$ of the $\epsilon_i$ are i.i.d. $N(0,σ^2)$ r.v.s the remaining fraction $(δ)$ follows some contamination distribution).

Time Series Models
Example: Measurement Model with Autoregressive Errors
Model:

  • $X_1, X_2, \ldots, X_n$ are $n$ successive measurements of a physical constant $\mu$
  • $X_i = \mu + e_i, \,i=1,2,\ldots,n$
  • $e_i = \beta_{e_{i-1}} + \epsilon_i, \,i=2,3,\ldots,n$ and $e_0=0$ where $\epsilon_i$ are i.i.d. with density $f(\cdot)$.

Note:

  • $e_i$ are not i.i.d but dependent.
  • $X_i$ are dependent as well
    \(\begin{aligned} &X_i = \mu(1-\beta) + \beta X_{i-1} + \epsilon_i, i=2,\ldots, n \\ &X_1 = \mu + \epsilon_1 \end{aligned}\)

Remarks

  1. How to prove a given parametrization identifiable?
  2. How to identify whether a given model is regular?