A self-learning note for statistical decision theory - part 1
As a statistician who often solves problems from a frequentist perspective but with a Bayesian soul (allegedly), statistical decision theory has always been an intriguing topic to me. Thanks to the holiday season, I finally got a chance to revisit this topic and I hope I can build up a more systematic understanding of it this time. To facilitate my mastery of the knowledge, I plan to learn by writing and sharing. As the very first step of this project, I’d like to cover the basic elements here in this blog post, including - (i) Motivation: what values can statistical decision theory offer? (ii) Notations and key concepts; (iii) Optimality criteria - how do we pick the decision among candidate decisions?
Motivation
The statistical decision theory, as noted by Dr. James O. Berger in his book “Statistical Decision Theory and Bayesian Analysis”, is concerned with the problem of making decisions in the presence of statistical knowledge which can help explain some of the uncertainties involved in the decision-making problem. Typically, the uncertainties are assumed to be represented by unknown quantities (parameters) - for example, the proportion of people for which the drug will prove effective, \(\theta\), which is generally unknown, and one must conduct experiments to obtain statistical information about it. The classical statistics targets at making inference about the parameters based on the sample information only, whereas the statistical decision theory is aimed at combing the sample information with other relevant knowledge (including possible consequences associated with different decisions, a-priori knowledge about the possible values of parameters) to make the decision.
Notations and Key Concepts
Notations:
- Observed data: \(X\)
- An action: \(a \in \mathcal{A}\)
- A decision rule: \(a = \delta(X)\)
- Parameters (state of the world): \(\theta \in \Theta\)
- A loss function: \(L(a, \theta)\), where \(L(a, \theta) \geqslant 0, \forall a \in \mathcal{A}, \theta \in \Theta\)
- Data generating model: \(f(X|\theta)\)
The most commonly used loss functions include
- Squared error: \(L(\theta, \delta(X)) = (\theta - \delta(X))^2\)
- Absolute error: \(L(\theta, \delta(X)) = |\theta - \delta(X)|\)
Frequentist risk
The frequentist decision theory relies upon the idea of evaluating how much we would “expect” to lose if we use \(\delta(X)\) repeatedly with varying data that arise from the data generating model \(f(X|\theta)\). The risk function of a decision rule is defined as
\[ R(\theta, \delta) = \mathbb{E}_{f(X|\theta)}[L(\theta, \delta(X))]\]
where the expectation is taken over all possible data. With the risk function in place, a frequentist picks a decision rule that minimizes the risk.
Comparing decision rules
We note that the ranking provided by the frequentist risk is a uni-dimensional function of decision rules only for every fixed value of \(\theta\). Therefore, in order to obtain a global comparison of the decision rules, we might want to aggregate the multi-dimensional ranking into a global ranking. We shall discuss the three most commonly used ideas (the first two are frequentist) in this section.
Admissibility
The idea of admissibility builds upon the notion of dominance relationship between decision rules. A decision rule is said to dominate another decision rule \(\delta'\) if
\[ R(\delta, \theta) \leqslant R(\delta', \theta) \] for all \(\theta\), and
\[ R(\delta, \theta) < R(\delta', \theta) \]
for at least one \(\theta\). Such decision rules are called admissible, whereas all other decision rules are inadmissible. It is worth noting here that the dominance relationship only generates a partial ordering among decision rules (not all decision rules are comparable).
Minimaxity
The idea of minimaxity is motivated by considering the worst-case risk for each fixed \(\delta\) under all possible values of \(\theta\)
\[ \tilde{R}(\delta) = sup_{\theta} R(\delta, \theta) \]
followed by a comparison on the worse-case risk. A minimax decision rule (if exists), solves the problem below
\[ \delta^{*} = argmin_{\delta} \tilde{R}(\delta) \]
Bayes risk
From a Bayesian perspective, as the existence of prior information about \(\theta\) is acknowledged, we can actually leverage the prior to trade off the risk across \(\theta\).
Given a prior distribution \(\pi\) on \(\Theta\), the Bayes risk of a decision rule \(\delta\) is defined as
\[ r(\pi, \delta) = \mathbb{E}_{\pi(\theta)}[R(\theta, \delta)] \]
where the expectation is taken over \(\theta\). A Bayesian seeks for a decision rule that minimizes the Bayes risk, and such a decision rule is called Bayes rule.
Relationships between these comparison criteria
Admissible rule and Minimax rule
Statement: If \(\delta^{*}\) has constant risk and is admissible, then we can show it is a minimax decision rule.
Proof: Assuming \(\delta^{*}\) it is not a minimax decision rule - thus there must exist another decision rule \(\delta'\) which has smaller minimax risk than \(\delta^{*}\). As a result, for an arbitrary \(\theta'\) we have
\[ R(\theta', \delta') \leqslant sup_{\theta} R(\theta, \delta') = \tilde{R}(\delta') < \tilde{R}(\delta^{*}) = sup_{\theta} R(\theta, \delta^{*}) \stackrel{\delta^{*} \ \text{has constant risk}}{=} R(\theta', \delta^{*}) \]
which contradicts with the admissibility of \(\delta^{*}\).
Bayes rule and admissible rule
Statement: If the prior \(\pi(\theta)\) is strictly positive and the Bayes decision rule \(\delta^{*}\) has finite risk and is continuous in \(\theta\), then it is admissible (proof by contradiction).
Proof: Let’s suppose that \(\delta^{*}\) is not admissible, thus we must have another decision rule \(\delta'\) that dominates \(\delta^{*}\). This implies
\[ r(\pi, \delta') = \int R(\theta, \delta') \pi(\theta) d\theta \stackrel{}{<} \int R(\theta, \delta^{*}) \pi(\theta) d\theta = r(\pi, \delta^{*}) \]
which contradicts with the fact that \(\delta^{*}\) is the Bayes rule.