Previously we considered the testing problem. We showed that within a tolerable type I error rate alpha, we have sufficient evidence to reject the null, and stop the experiment as soon as e > 1/alpha.
Here we consider the problem of interval estimation. An anytime-valid 95% interval estimate should contain the true value with at least 95% chance –regardless of when, why or even if data collection stopped.
We illustrate the following:
For this demonstration we use functions from the safestats package and the following colours:
library("safestats")
freqColours <- c("#E31A1CE6", "#FB9A9980")
bayesColours <- c("#B15928E6", "#FFFF9980")
eColours <- c("#1F78B4E6", "#A6CEE380")
alpha <- 0.05
As a running example we consider the problem of interval estimation of a normal mean with known variance, say, \(\sigma=1\). That is, we want to give an uncertainty estimate of \(\mu\) where
\[\begin{align} X_{i} \overset{\text{iid}}{\sim} \mathcal{N}(\mu, \sigma^{2}). \end{align}\]
For all three examples we use the
computeConfidenceIntervalZ
function from the safestats
package, which produces (1) classical confidence intervals if
eType="freq"
, (2) credible intervals if
eType="credibleInterval"
, and (3) an anytime-valid
confidence sequence if eType="mom"
.
To assess the quality of these intervals across time, we fix a true data generating mean, say, muTrue=8, and we set sigmaTrue=1. For simplicity we take n=100 and the following code generates a single data set that we use to highlight the difference between (1) the standard confidence intervals, (2) the Bayesian intervals, and (3) the anytime-valid confidence sequence.
n <- 100
muTrue <- 8
sigmaTrue <- 1
set.seed(4)
someDataSingle <- rnorm(n, mean=muTrue,
sd=sigmaTrue)
After each observation a confidence interval can be computed yielding n=100 intervals. If any of these n=100 intervals does not contain muTrue we say that the sequence of intervals does not cover across time.
The classical confidence interval for a normal mean is directly related to the p-value test z-test, where the z-statistic is defined as
\[\begin{align} Z := \frac{\sqrt{n} (\bar{x} - \mu)}{\sigma} \end{align}\]
All parameter values \(\mu\) in a classical 95% confidence intervals are all those \(\mu\) that did not lead to a rejection at level alpha=5%. “Inverting” the z-test shows that
\[\begin{align} \bar{X} \sim \mathcal{N}(\mu, \frac{1}{n}) . \end{align}\]
A 95% confidence interval can be derived from quantiles of a normal distribution with location \(\bar{x}\) and variance \(1/n\).
To assess the performance of the classical confidence interval on the data generated with muTrue=8, we run the following code:
freqCi <- matrix(nrow=n, ncol=2)
meanVector <- 1/(1:n)*cumsum(someDataSingle)
for (i in 1:n) {
tempResult <- computeConfidenceIntervalZ(nEff=i,
meanObs=meanVector[i],
eType="freq")
freqCi[i, ] <- tempResult
}
oldPar <- setSafeStatsPlotOptionsAndReturnOldOnes();
plot(1:n, freqCi[, 1], xlim=c(0, n), ylim=c(7, 9),
type="l", xlab = "", ylab = "", cex.lab = 1.3,
cex.axis = 1.3)
polygon(c(1:n, n:1), c(freqCi[, 2], rev(freqCi[, 1])),
col=freqColours[2], border=freqColours[1], lwd=2,
density = NULL, angle = -20)
lines(c(1, n), c(muTrue, muTrue), lwd=2)
The following code shows that between the 21st and 45th observation the true mean is outside the intervals.
# The indices where the lower bound is above muTrue
which(freqCi[, 1] > muTrue)
# No indices where the upper bound is below muTrue
which(freqCi[, 2] < muTrue)
Hence, classical confidence intervals lead to incorrect conclusions regarding muTrue, if data collection happened to stop in this range.
The previous code worked for a specific data set/outcomes of a single
experiment. Note that this 1 data set yielded 100 intervals. We now
repeat the procedure, but over 1,000 data sets/experiments. The code
below stores 1,000 times 100 classical confidence intervals in the
variable allFreqCis
. We will see that far fewer than 950
out of these 1,000 data sets/repeated experiments will cover the true
mean across time.
The following code stores a 1,000 times 100 confidence intervals:
mIter <- 1000
set.seed(1)
allData <- matrix(rnorm(mIter*n, mean=muTrue), nrow=mIter)
allFreqCis <- array(dim=c(mIter, n, 2))
# This indicates whether a simulation yielded an interval that
# does not cover the true mean
freqCiError <- integer(mIter)
# This indicates the first time an interval does not cover the true mean
firstPassageTime <- rep(Inf, times=mIter)
for (sim in 1:mIter) {
someData <- allData[sim, ]
meanVector <- 1/(1:n)*cumsum(someData)
for (i in 1:n) {
tempResult <- computeConfidenceIntervalZ(
nEff=i, meanObs=meanVector[i], eType="freq")
allFreqCis[sim, i, ] <- tempResult
if ((tempResult[1] > muTrue || tempResult[2] < muTrue)
&& freqCiError[sim] != 1) {
freqCiError[sim] <- 1
firstPassageTime[sim] <- i
}
}
}
To count the number of data sets/experiments that yielded an interval that did not cover the true mean at observation 1 to 100, we run the following code:
complementCoverageRate <- numeric(n)
for (i in 1:n) {
complementCoverageRate[i] <- mean(firstPassageTime <= i)
}
freqCoverageRate <- 1-complementCoverageRate
oldPar <- setSafeStatsPlotOptionsAndReturnOldOnes();
plot(1:n, 100*freqCoverageRate, type="l", lwd=2, xlab="n",
ylab="Coverage rate (%)", col=freqColours[1], ylim=c(60, 100))
lines(c(1, n), 100*c(1-alpha, 1-alpha), lwd=2, lty=2)
This plot reiterates the point that classical methods (i.e. p-values and confidence intervals) are made for one-shot inference. After the first time the confidence interval is computed the tolerable 5% error was already “spent”.
The following code can be run to inspect one of the failed intervals:
failedIndeces <- which(freqCiError==1)
someIndex <- failedIndeces[3]
oldPar <- setSafeStatsPlotOptionsAndReturnOldOnes();
plot(NULL, xlim=c(0, n), ylim=c(7, 9),
type="l", xlab = "", ylab = "", cex.lab = 1.3,
cex.axis = 1.3)
polygon(c(1:n, n:1), c(allFreqCis[someIndex, , 2],
rev(allFreqCis[someIndex, , 1])),
col=freqColours[2], border=freqColours[1], lwd=2,
density = NULL, angle = -20)
lines(c(1, n), c(muTrue, muTrue), lwd=2)
The (Bayesian) credible interval for a normal mean is easily computed with a normal prior for the mean parameter mu. A normal prior with normal data leads to a normal posterior, which is computationally attractive. More precisely, with the normal prior centred at a and variance g, that is,
\[\begin{align} \mu \sim \mathcal{N}(a, g) \end{align}\]
and normal data with variance \(\sigma^{2}=1\) we get the following normal posterior
\[\begin{align} \mu \mid x, n \sim \mathcal{N}\left ( \tfrac{n g}{1 + n g} \bar{x} + \tfrac{1}{1+ n g} a, \frac{g}{1 + n g} \right ) \end{align}\]
For our demonstration we set a=0 and g=1. We discuss sensitivity to the hyper parameter g below. A 95% credible interval can now be derived from quantiles of a normal distribution with the mean and variance shown above.
To assess the performance of the credible interval on the data with muTrue=8, we run the following code:
someA <- 0
someG <- 1
bayesCi <- matrix(nrow=n, ncol=2)
meanVector <- 1/(1:n)*cumsum(someDataSingle)
for (i in 1:n) {
tempResult <- computeConfidenceIntervalZ(
nEff=i, meanObs=meanVector[i],
a=someA, g=someG, eType="credibleInterval")
bayesCi[i, ] <- tempResult
}
oldPar <- setSafeStatsPlotOptionsAndReturnOldOnes();
plot(NULL, xlim=c(0, n), ylim=c(7, 9),
type="l", xlab = "", ylab = "", cex.lab = 1.3,
cex.axis = 1.3)
polygon(c(1:n, n:1), c(bayesCi[, 2], rev(bayesCi[, 1])),
col=bayesColours[2], border=bayesColours[1], lwd=2,
density = NULL, angle = 20)
lines(c(1, n), c(muTrue, muTrue), lwd=2)
Similar to the frequentist confidence intervals, the true mean falls outside the credible intervals between the 21st and the 45th observations, but also at the 11th and 17th obsrevations.
# The indices where the lower bound is above muTrue
which(bayesCi[, 1] > muTrue)
# No indices where the upper bound is below muTrue
which(bayesCi[, 2] < muTrue)
The credible intervals thus provide incorrect information regarding the size of the true population mean, if data collection happened to stop in this range. Better coverage is retrieved if the prior mean were set equal to the true mean. However, if the true mean was known before hand, then it would not be prior knowledge. In fact, there would be no need to do the experiment. In general, if the hyper parameter “a” is set close to the true mean, then the coverage will be better. For any a priori guess “a” there are many true values of mu for which the Bayesian interval will have bad coverage. Our point is not to open a discussion on the prior choice of “a”, but about coverage across time. We will see that even if “a” is set exactly equal to muTrue, then across time a 95% credible interval will have coverage rate below 95%.
The previous code worked for a specific data set/outcomes of a single
experiment. Note that this 1 data set yielded 100 intervals. We now
repeat the procedure, but over 1,000 data sets/experiments. The code
below stores 1,000 times 100 credible intervals in the variable
allBayesCis
.
We will see that far fewer than 950 out of these 1,000 data sets/repeated experiments will cover the true mean across time. To show that it is not a matter of setting the mean parameter a, we set it in our demonstration equal to the true mean, despite this not being realistic in practice.
mIter <- 1000
someA <- muTrue
set.seed(1)
allData <- matrix(rnorm(mIter*n, mean=muTrue), nrow=mIter)
allBayesCis <- array(dim=c(mIter, n, 2))
# This indicates whether a simulation yielded an interval that
# does not cover the true mean
bayesCiError <- integer(mIter)
# This indicates the first time an interval does not cover the true mean
firstPassageTime <- rep(Inf, times=mIter)
for (sim in 1:mIter) {
someData <- allData[sim, ]
meanVector <- 1/(1:n)*cumsum(someData)
for (i in 1:n) {
tempResult <- computeConfidenceIntervalZ(
nEff=i, meanObs=meanVector[i],
a=someA, g=someG,
eType="credibleInterval")
allBayesCis[sim, i, ] <- tempResult
if ((tempResult[1] > muTrue || tempResult[2] < muTrue)
&& bayesCiError[sim] != 1) {
bayesCiError[sim] <- 1
firstPassageTime[sim] <- i
}
}
}
To count the number of data sets/experiments that yielded an interval that did not cover the true mean at observation 1 to 100, we run the following code:
complementCoverageRate <- numeric(n)
for (i in 1:n) {
complementCoverageRate[i] <- mean(firstPassageTime <= i)
}
bayesCoverageRate <- 1-complementCoverageRate
oldPar <- setSafeStatsPlotOptionsAndReturnOldOnes();
plot(1:n, 100*bayesCoverageRate, type="l", lwd=2, xlab="n",
ylab="Coverage rate (%)", col=bayesColours[1], ylim=c(60, 100))
lines(c(1, n), 100*c(1-alpha, 1-alpha), lwd=2, lty=2)
This plot shows that the credible interval centred at the a=muTrue yield worse coverage than the classical confidence intervals.
The following code can be run to inspect one of the failed intervals:
failedIndeces <- which(bayesCiError==1)
someIndex <- failedIndeces[3]
oldPar <- setSafeStatsPlotOptionsAndReturnOldOnes();
plot(NULL, xlim=c(0, n), ylim=c(7, 9),
type="l", xlab = "", ylab = "", cex.lab = 1.3,
cex.axis = 1.3)
polygon(c(1:n, n:1), c(allBayesCis[someIndex, , 2],
rev(allBayesCis[someIndex, , 1])),
col=bayesColours[2], border=bayesColours[1], lwd=2,
density = NULL, angle = 20)
lines(c(1, n), c(muTrue, muTrue), lwd=2)
The anytime-valid confidence sequence inverts the e-variable based on a so-called non-local moment distribution mixture. In other words, the 95% anytime-valid confidence sequence contain all parameter values mu for which the e-variable is less than 1/alpha = 20. Because the e-variable itself retains a type I error rate across time, so will the resulting confidence sequence.
To assess the performance of the anytime-valid confidence sequence on the data with muTrue=8, we run the following code:
anytimeCi <- matrix(nrow=n, ncol=2)
meanVector <- 1/(1:n)*cumsum(someDataSingle)
for (i in 1:n) {
tempResult <- computeConfidenceIntervalZ(
nEff=i, parameter=designObj$parameter,
meanObs=meanVector[i])
anytimeCi[i, ] <- tempResult
}
oldPar <- setSafeStatsPlotOptionsAndReturnOldOnes();
plot(NULL, xlim=c(0, n), ylim=c(7, 9),
type="l", xlab = "", ylab = "", cex.lab = 1.3,
cex.axis = 1.3)
polygon(c(1:n, n:1), c(anytimeCi[, 2], rev(anytimeCi[, 1])),
col=eColours[2], border=eColours[1], lwd=2,
density = NULL, angle = -20)
lines(c(1, n), c(muTrue, muTrue), lwd=2)
The following code shows that the anytime-valid confidence sequence covers the true mean at each moment in time.
# No indices where the lower bound is above muTrue
which(anytimeCi[, 1] > muTrue)
# No indices where the upper bound is below muTrue
which(anytimeCi[, 2] < muTrue)
The anytime-valid intervals would thus lead to the correct conclusion that muTrue is a viable value for the true mean, regardless if, whether, or why data collection is stopped. To compare the three intervals we run the following code.
oldPar <- setSafeStatsPlotOptionsAndReturnOldOnes();
plot(NULL, xlim=c(0, n), ylim=c(7, 9),
type="l", xlab = "", ylab = "", cex.lab = 1.3,
cex.axis = 1.3)
polygon(c(1:n, n:1), c(anytimeCi[, 2], rev(anytimeCi[, 1])),
col=eColours[2], border=eColours[1], lwd=2,
density = NULL, angle = -20)
polygon(c(1:n, n:1), c(freqCi[, 2], rev(freqCi[, 1])),
col=freqColours[2], border=freqColours[1], lwd=2,
density = 60, angle = -20)
polygon(c(1:n, n:1), c(bayesCi[, 2], rev(bayesCi[, 1])),
col=bayesColours[2], border=bayesColours[1], lwd=2,
density = 40, angle = 20)
lines(c(1, n), c(muTrue, muTrue), lwd=2)
For large sample sizes the standard confidence interval and the credible interval become indistinguishable, and the difference between them is more prominent at the start.
The previous code worked for a specific data set/outcomes of a single
experiment. Note that this 1 data set yielded 100 intervals. We now
repeat the procedure, but over 1,000 data sets/experiments. The code
below stores 1,000 times 100 credible intervals in the variable
allAnytimeCis
.
We will see that more than 950 out of these 1,000 data sets/repeated experiments will cover the true mean across time.
mIter <- 1000
set.seed(1)
allData <- matrix(rnorm(mIter*n, mean=muTrue), nrow=mIter)
allAnytimeCis <- array(dim=c(mIter, n, 2))
# This indicates whether a simulation yielded an interval that
# does not cover the true mean
anytimeCiError <- integer(mIter)
# This indicates the first time an interval does not cover the true mean
firstPassageTime <- rep(Inf, times=mIter)
for (sim in 1:mIter) {
someData <- allData[sim, ]
meanVector <- 1/(1:n)*cumsum(someData)
for (i in 1:n) {
tempResult <- computeConfidenceIntervalZ(
nEff=i, meanObs=meanVector[i],
parameter=designObj$parameter)
allAnytimeCis[sim, i, ] <- tempResult
if ((tempResult[1] > muTrue || tempResult[2] < muTrue)
&& anytimeCiError[sim] != 1) {
anytimeCiError[sim] <- 1
firstPassageTime[sim] <- i
}
}
}
To count the number of data sets/experiments that yielded an interval that did not cover the true mean at observation 1 to 100, we run the following code:
complementCoverageRate <- numeric(n)
for (i in 1:n) {
complementCoverageRate[i] <- mean(firstPassageTime <= i)
}
anytimeCiCoverageRate <- 1-complementCoverageRate
oldPar <- setSafeStatsPlotOptionsAndReturnOldOnes();
plot(1:n, 100*anytimeCiCoverageRate, type="l", lwd=2,
xlab="n", ylab="Coverage rate (%)", col=eColours[1],
ylim=c(60, 100))
lines(c(1, n), 100*c(1-alpha, 1-alpha), lwd=2, lty=2)
This plot shows that the anytime-valid confidence sequences cover the true mean under repeated use as promised.
The following code can be run to inspect one of the failed intervals:
failedIndeces <- which(anytimeCiError==1)
someIndex <- failedIndeces[3]
oldPar <- setSafeStatsPlotOptionsAndReturnOldOnes();
plot(NULL, xlim=c(0, n), ylim=c(7, 9),
type="l", xlab = "", ylab = "", cex.lab = 1.3,
cex.axis = 1.3)
polygon(c(1:n, n:1), c(allAnytimeCis[someIndex, , 2],
rev(allAnytimeCis[someIndex, , 1])),
col=eColours[2], border=eColours[1], lwd=2,
density = NULL, angle = 20)
lines(c(1, n), c(muTrue, muTrue), lwd=2)
The following code can be used to see all coverage rate profiles.
oldPar <- setSafeStatsPlotOptionsAndReturnOldOnes();
plot(1:n, 100*anytimeCiCoverageRate, type="l", lwd=2,
xlab="n", ylab="Coverage rate (%)", col=eColours[1],
ylim=c(60, 100))
lines(1:n, 100*bayesCoverageRate, col=bayesColours[1],
lwd=2)
lines(1:n, 100*freqCoverageRate, col=freqColours[1], lwd=2)
lines(c(1, n), 100*c(1-alpha, 1-alpha), lwd=2, lty=2)
We provide some further elaboration on the how/why an anytime-valid confidence interval improves over the other intervals. Of importance is that the widths of both the classical confidence intervals and credible intervals are proportional to \(1/\sqrt{n}\). On the other hand, the width of the anytime-valid confidence interval is stretched by a logarithmic function of \(n\). To see this we plot the widths of the intervals as a function of n.
someA <- 0
nDomain <- 1:30
anytimeCiWidth <- credibleIntervalWidth <- freqCiWidth <- numeric(30)
for (i in nDomain) {
tempResult <- computeConfidenceIntervalZ(
nEff=i, meanObs=8, eType="freq")
freqCiWidth[i] <- tempResult[2]-tempResult[1]
tempResult <- computeConfidenceIntervalZ(
nEff=i, meanObs=8, a=someA,
g=someG, eType="credibleInterval")
credibleIntervalWidth[i] <- tempResult[2]-tempResult[1]
tempResult <- computeConfidenceIntervalZ(
nEff=i, meanObs=8, parameter=designObj$parameter)
anytimeCiWidth[i] <- tempResult[2]-tempResult[1]
}
oldPar <- setSafeStatsPlotOptionsAndReturnOldOnes();
plot(nDomain, anytimeCiWidth, col=eColours[1], type="l",
lwd=2, ylim=c(0, 12))
lines(nDomain, credibleIntervalWidth, col=bayesColours[1],
lwd=2)
lines(nDomain, freqCiWidth, col=freqColours[1], lwd=2,
lty=2)
Hence, the anytime-valid confidence sequence is wider than the standard and Bayesian intervals. Equivalently, both standard confidence and (Bayesian) credible intervals are too narrow for them to be anytime-valid.
Note that the width of the credible intervals behave similarly to that of the classical confidence intervals. This correspondence is exact if the hyper parameter g of the credible interval is taken to be large. In that case, the credible intervals equals the classical confidence intervals.
Another difference to highlight is that credible intervals and anytime-valid confidence sequences change differently depending on the hyper parameter g and gMom. For g small the credible interval will be more narrow around “a”. The prior belief that “a” is the true mean then dominates. As noted above, as g grows then the credible interval coincides with the classical confidence interval.
For anytime-valid confidence sequences the following holds. For gMom small, the interval becomes infinitely wide resulting in more uncertainty. As gMom increases the width also increases, though, slowly. The following code plots the width of an anytime-valid confidence interval as a function of gMom. The following code illustrates the dependency of the interval widths on g and gMom.
someA <- 0
gDomain <- seq(0.01, 100, by=0.01)
anytimeCiWidth <- credibleIntervalWidth <- numeric(30)
for (i in seq_along(gDomain)) {
tempResult <- computeConfidenceIntervalZ(
nEff=1, meanObs=8, a=0, g=gDomain[i],
eType="credibleInterval")
credibleIntervalWidth[i] <- tempResult[2]-tempResult[1]
tempResult <- computeConfidenceIntervalZ(
nEff=1, meanObs=8, parameter=gDomain[i])
anytimeCiWidth[i] <- tempResult[2]-tempResult[1]
}
oldPar <- setSafeStatsPlotOptionsAndReturnOldOnes();
plot(gDomain, anytimeCiWidth, col=eColours[1], type="l",
lwd=2, ylim=c(0, 16), xlab="g/gMom", ylab="Interval width")
lines(gDomain, credibleIntervalWidth, col=bayesColours[1],
lwd=2)
The plot provides an indication on how the widths of the two intervals differ for a range of hyper parameter values g and gMom set equal to g. The increase in width of the anytime-valid confidence sequence is hardly noticeable, because it increases very slowly (logarithmically), whereas the width of the credible interval stabilises at the width of the classical confidence interval.