A self-learning note for statistical decision theory

As a statistician who often solves problems from a frequentist perspective but with a Bayesian soul (allegedly), statistical decision theory has always been an intriguing topic to me. Thanks to the holiday season, I finally got a chance to revisit this topic and I hope I can build up a more systematic understanding of it this time. To facilitate my mastery of the knowledge, I plan to learn by writing and sharing. As the very first step of this project, I’d like to cover the basic elements here in this blog post, including - (i) Motivation: what values can statistical decision theory offer? (ii) Notations and key concepts; (iii) Optimality criteria - how do we pick the decision among candidate decisions?

Motivation

The statistical decision theory, as noted by Dr. James O. Berger in his book “Statistical Decision Theory and Bayesian Analysis”, is concerned with the problem of making decisions in the presence of statistical knowledge which can help explain some of the uncertainties involved in the decision-making problem. Typically, the uncertainties are assumed to be represented by unknown quantities (parameters) - for example, the proportion of people for which the drug will prove effective, \(\theta\), which is generally unknown, and one must conduct experiments to obtain statistical information about it. The classical statistics targets at making inference about the parameters based on the sample information only, whereas the statistical decision theory is aimed at combing the sample information with other relevant knowledge (including possible consequences associated with different decisions, a-priori knowledge about the possible values of parameters) to make the decision.

Notations and Key Concepts

Notations:

Observed data: \(X\)
An action: \(a \in \mathcal{A}\)
A decision rule: \(a = \delta(X)\)
Parameters (state of the world): \(\theta \in \Theta\)
A loss function: \(L(a, \theta)\), where \(L(a, \theta) \geqslant 0, \forall a \in \mathcal{A}, \theta \in \Theta\)
Data generating model: \(f(X|\theta)\)

The most commonly used loss functions include

Squared error: \(L(\theta, \delta(X)) = (\theta - \delta(X))^2\)
Absolute error: \(L(\theta, \delta(X)) = |\theta - \delta(X)|\)

Frequentist risk

The frequentist decision theory relies upon the idea of evaluating how much we would “expect” to lose if we use \(\delta(X)\) repeatedly with varying data that arise from the data generating model \(f(X|\theta)\). The risk function of a decision rule is defined as

\[ R(\theta, \delta) = \mathbb{E}_{f(X|\theta)}[L(\theta, \delta(X))]\]

where the expectation is taken over all possible data. With the risk function in place, a frequentist picks a decision rule that minimizes the risk.

Comparing decision rules

We note that the ranking provided by the frequentist risk is a uni-dimensional function of decision rules only for every fixed value of \(\theta\). Therefore, in order to obtain a global comparison of the decision rules, we might want to aggregate the multi-dimensional ranking into a global ranking. We shall discuss the three most commonly used ideas (the first two are frequentist) in this section.

Admissibility

The idea of admissibility builds upon the notion of dominance relationship between decision rules. A decision rule is said to dominate another decision rule \(\delta'\) if

\[ R(\delta, \theta) \leqslant R(\delta', \theta) \] for all \(\theta\), and

\[ R(\delta, \theta) < R(\delta', \theta) \]

for at least one \(\theta\). Such decision rules are called admissible, whereas all other decision rules are inadmissible. It is worth noting here that the dominance relationship only generates a partial ordering among decision rules (not all decision rules are comparable).

Minimaxity

The idea of minimaxity is motivated by considering the worst-case risk for each fixed \(\delta\) under all possible values of \(\theta\)

\[ \tilde{R}(\delta) = sup_{\theta} R(\delta, \theta) \]

followed by a comparison on the worse-case risk. A minimax decision rule (if exists), solves the problem below

\[ \delta^{*} = argmin_{\delta} \tilde{R}(\delta) \]

Bayes risk

From a Bayesian perspective, as the existence of prior information about \(\theta\) is acknowledged, we can actually leverage the prior to trade off the risk across \(\theta\).

Given a prior distribution \(\pi\) on \(\Theta\), the Bayes risk of a decision rule \(\delta\) is defined as

\[ r(\pi, \delta) = \mathbb{E}_{\pi(\theta)}[R(\theta, \delta)] \]

where the expectation is taken over \(\theta\). A Bayesian seeks for a decision rule that minimizes the Bayes risk, and such a decision rule is called Bayes rule.

Relationships between these comparison criteria

Admissible rule and Minimax rule

Statement: If \(\delta^{*}\) has constant risk and is admissible, then we can show it is a minimax decision rule.

Proof: Assuming \(\delta^{*}\) it is not a minimax decision rule - thus there must exist another decision rule \(\delta'\) which has smaller minimax risk than \(\delta^{*}\). As a result, for an arbitrary \(\theta'\) we have

\[ R(\theta', \delta') \leqslant sup_{\theta} R(\theta, \delta') = \tilde{R}(\delta') < \tilde{R}(\delta^{*}) = sup_{\theta} R(\theta, \delta^{*}) \stackrel{\delta^{*} \ \text{has constant risk}}{=} R(\theta', \delta^{*}) \]

which contradicts with the admissibility of \(\delta^{*}\).

Bayes rule and admissible rule

Statement: If the prior \(\pi(\theta)\) is strictly positive and the Bayes decision rule \(\delta^{*}\) has finite risk and is continuous in \(\theta\), then it is admissible (proof by contradiction).

Proof: Let’s suppose that \(\delta^{*}\) is not admissible, thus we must have another decision rule \(\delta'\) that dominates \(\delta^{*}\). This implies

\[ r(\pi, \delta') = \int R(\theta, \delta') \pi(\theta) d\theta \stackrel{}{<} \int R(\theta, \delta^{*}) \pi(\theta) d\theta = r(\pi, \delta^{*}) \]

which contradicts with the fact that \(\delta^{*}\) is the Bayes rule.

TheStatsDude's blog

A self-learning note for statistical decision theory - part 1