Previously we considered the testing problem. We showed that within a tolerable type I error rate alpha, we have sufficient evidence to reject the null, and stop the experiment as soon as e > 1/alpha.
Here we consider the problem of interval estimation. An anytime-valid 95% interval estimate should contain the true value with at least 95% chance –regardless of when, why or even if data collection stopped.
We illustrate the following:
For this demonstration we use functions from the safestats package and the following colours:
library("safestats")
# freqColours <- c("#E31A1CE6", "#FB9A9980")
freqColours <- c("#FFDA1A", "#DAA52066")
eColours <- c("#1F78B4E6", "#A6CEE380")
alpha <- 0.05
As a running example we consider the problem of interval estimation of a normal mean with known variance, say, \(\sigma=1\). That is, we want to give an uncertainty estimate of \(\mu\) where
\[\begin{align} X_{i} \overset{\text{iid}}{\sim} \mathcal{N}(\mu, \sigma^{2}). \end{align}\]
For both examples we use the computeConfidenceIntervalZ
function from the safestats package, which produces (1) classical
confidence intervals if eType="freq"
, and (2) an
anytime-valid confidence sequence if eType="mom"
.
To assess the quality of these intervals across time, we fix a true data generating mean, say, muTrue=8, and we set sigmaTrue=1. For simplicity we take n=100 and the following code generates a single data set that we use to highlight the difference between the standard confidence intervals and the anytime-valid confidence sequence.
n <- 100
muTrue <- 8
sigmaTrue <- 1
set.seed(4)
someDataSingle <- rnorm(n, mean=muTrue,
sd=sigmaTrue)
After each observation a confidence interval can be computed yielding n=100 intervals. If any of these n=100 intervals does not contain muTrue we say that the sequence of intervals does not cover across time.
The classical confidence interval for a normal mean is directly related to the p-value test z-test, where the z-statistic is defined as
\[\begin{align} Z := \frac{\sqrt{n} (\bar{x} - \mu)}{\sigma} \end{align}\]
All parameter values \(\mu\) in a classical 95% confidence intervals are all those \(\mu\) that did not lead to a rejection at level alpha=5%. “Inverting” the z-test shows that
\[\begin{align} \bar{X} \sim \mathcal{N}(\mu, \frac{1}{n}) . \end{align}\]
A 95% confidence interval can be derived from quantiles of a normal distribution with location \(\bar{x}\) and variance \(1/n\).
To assess the performance of the classical confidence interval on the data generated with muTrue=8, we run the following code:
freqCi <- matrix(nrow=n, ncol=2)
meanVector <- 1/(1:n)*cumsum(someDataSingle)
for (i in 1:n) {
tempResult <- computeConfidenceIntervalZ(nEff=i,
meanObs=meanVector[i],
eType="freq",
parameter=1)
freqCi[i, ] <- tempResult
}
oldPar <- setSafeStatsPlotOptionsAndReturnOldOnes();
plot(1:n, freqCi[, 1], xlim=c(0, n), ylim=c(7, 9),
type="l", xlab = "", ylab = "", cex.lab = 1.3,
cex.axis = 1.3)
polygon(c(1:n, n:1), c(freqCi[, 2], rev(freqCi[, 1])),
col=freqColours[2], border=freqColours[1], lwd=2,
density = NULL, angle = -20)
lines(c(1, n), c(muTrue, muTrue), lwd=2)
The following code shows that between the 21st and 45th observation the true mean is outside the intervals.
# The indices where the lower bound is above muTrue
which(freqCi[, 1] > muTrue)
# No indices where the upper bound is below muTrue
which(freqCi[, 2] < muTrue)
Hence, the classical confidence intervals lead to incorrect conclusions regarding muTrue, if data collection happened to stop in this range.
The previous code worked for a specific data set/outcomes of a single
experiment. Note that this 1 data set yielded 100 intervals. We now
repeat the procedure, but over 1,000 data sets/experiments. The code
below stores 1,000 times 100 classical confidence intervals in the
variable allFreqCis
. We will see that far fewer than 950
out of these 1,000 data sets/repeated experiments will cover the true
mean across time.
The following code stores a 1,000 times 100 confidence intervals:
mIter <- 1000
set.seed(1)
allData <- matrix(rnorm(mIter*n, mean=muTrue), nrow=mIter)
allFreqCis <- array(dim=c(mIter, n, 2))
# This indicates whether a simulation yielded an interval that
# does not cover the true mean
freqCiError <- integer(mIter)
# This indicates the first time an interval does not cover the true mean
firstPassageTime <- rep(Inf, times=mIter)
for (sim in 1:mIter) {
someData <- allData[sim, ]
meanVector <- 1/(1:n)*cumsum(someData)
for (i in 1:n) {
tempResult <- computeConfidenceIntervalZ(
nEff=i, meanObs=meanVector[i], eType="freq",
parameter=1)
allFreqCis[sim, i, ] <- tempResult
if ((tempResult[1] > muTrue || tempResult[2] < muTrue)
&& freqCiError[sim] != 1) {
freqCiError[sim] <- 1
firstPassageTime[sim] <- i
}
}
}
To count the number of data sets/experiments that yielded an interval that did not cover the true mean at observation 1 to 100, we run the following code:
complementCoverageRate <- numeric(n)
for (i in 1:n) {
complementCoverageRate[i] <- mean(firstPassageTime <= i)
}
freqCoverageRate <- 1-complementCoverageRate
oldPar <- setSafeStatsPlotOptionsAndReturnOldOnes();
plot(1:n, 100*freqCoverageRate, type="l", lwd=2, xlab="n",
ylab="Coverage rate (%)", col=freqColours[1], ylim=c(60, 100))
lines(c(1, n), 100*c(1-alpha, 1-alpha), lwd=2, lty=2)
This plot reiterates the point that classical methods (i.e. p-values and confidence intervals) are made for one-shot inference. After the first time the confidence interval is computed the tolerable 5% error was already “spent”.
The following code can be run to inspect one of the failed intervals:
failedIndeces <- which(freqCiError==1)
someIndex <- failedIndeces[3]
oldPar <- setSafeStatsPlotOptionsAndReturnOldOnes();
plot(NULL, xlim=c(0, n), ylim=c(7, 9),
type="l", xlab = "", ylab = "", cex.lab = 1.3,
cex.axis = 1.3)
polygon(c(1:n, n:1), c(allFreqCis[someIndex, , 2],
rev(allFreqCis[someIndex, , 1])),
col=freqColours[2], border=freqColours[1], lwd=2,
density = NULL, angle = -20)
lines(c(1, n), c(muTrue, muTrue), lwd=2)
The anytime-valid confidence sequence inverts the e-variable based on a so-called non-local moment distribution mixture. In other words, the 95% anytime-valid confidence sequence contain all parameter values mu for which the e-variable is less than 1/alpha = 20. Because the e-variable itself retains a type I error rate across time, so will the resulting confidence sequence.
To assess the performance of the anytime-valid confidence sequence on the data with muTrue=8, we run the following code:
anytimeCi <- matrix(nrow=n, ncol=2)
meanVector <- 1/(1:n)*cumsum(someDataSingle)
for (i in 1:n) {
tempResult <- computeConfidenceIntervalZ(
nEff=i, parameter=designObj$parameter,
meanObs=meanVector[i])
anytimeCi[i, ] <- tempResult
}
oldPar <- setSafeStatsPlotOptionsAndReturnOldOnes();
plot(NULL, xlim=c(0, n), ylim=c(7, 9),
type="l", xlab = "", ylab = "", cex.lab = 1.3,
cex.axis = 1.3)
polygon(c(1:n, n:1), c(anytimeCi[, 2], rev(anytimeCi[, 1])),
col=eColours[2], border=eColours[1], lwd=2,
density = NULL, angle = -20)
lines(c(1, n), c(muTrue, muTrue), lwd=2)
The following code shows that the anytime-valid confidence sequence covers the true mean at each moment in time.
# No indices where the lower bound is above muTrue
which(anytimeCi[, 1] > muTrue)
# No indices where the upper bound is below muTrue
which(anytimeCi[, 2] < muTrue)
The anytime-valid intervals would thus lead to the correct conclusion that muTrue is a viable value for the true mean, regardless if, whether, or why data collection is stopped. To compare the two intervals we run the following code.
oldPar <- setSafeStatsPlotOptionsAndReturnOldOnes();
plot(NULL, xlim=c(0, n), ylim=c(7, 9),
type="l", xlab = "", ylab = "", cex.lab = 1.3,
cex.axis = 1.3)
polygon(c(1:n, n:1), c(anytimeCi[, 2], rev(anytimeCi[, 1])),
col=eColours[2], border=eColours[1], lwd=2,
density = NULL, angle = -20)
polygon(c(1:n, n:1), c(freqCi[, 2], rev(freqCi[, 1])),
col=freqColours[2], border=freqColours[1], lwd=2,
density = 60, angle = -20)
lines(c(1, n), c(muTrue, muTrue), lwd=2)
The previous code worked for a specific data set/outcomes of a single
experiment. Note that this 1 data set yielded 100 intervals. We now
repeat the procedure, but over 1,000 data sets/experiments. The code
below stores 1,000 times 100 credible intervals in the variable
allAnytimeCis
.
We will see that more than 950 out of these 1,000 data sets/repeated experiments will cover the true mean across time.
mIter <- 1000
set.seed(1)
allData <- matrix(rnorm(mIter*n, mean=muTrue), nrow=mIter)
allAnytimeCis <- array(dim=c(mIter, n, 2))
# This indicates whether a simulation yielded an interval that
# does not cover the true mean
anytimeCiError <- integer(mIter)
# This indicates the first time an interval does not cover the true mean
firstPassageTime <- rep(Inf, times=mIter)
for (sim in 1:mIter) {
someData <- allData[sim, ]
meanVector <- 1/(1:n)*cumsum(someData)
for (i in 1:n) {
tempResult <- computeConfidenceIntervalZ(
nEff=i, meanObs=meanVector[i],
parameter=designObj$parameter)
allAnytimeCis[sim, i, ] <- tempResult
if ((tempResult[1] > muTrue || tempResult[2] < muTrue)
&& anytimeCiError[sim] != 1) {
anytimeCiError[sim] <- 1
firstPassageTime[sim] <- i
}
}
}
To count the number of data sets/experiments that yielded an interval that did not cover the true mean at observation 1 to 100, we run the following code:
complementCoverageRate <- numeric(n)
for (i in 1:n) {
complementCoverageRate[i] <- mean(firstPassageTime <= i)
}
anytimeCiCoverageRate <- 1-complementCoverageRate
oldPar <- setSafeStatsPlotOptionsAndReturnOldOnes();
plot(1:n, 100*anytimeCiCoverageRate, type="l", lwd=2,
xlab="n", ylab="Coverage rate (%)", col=eColours[1],
ylim=c(60, 100))
lines(c(1, n), 100*c(1-alpha, 1-alpha), lwd=2, lty=2)
This plot shows that the anytime-valid confidence sequences cover the true mean under repeated use as promised.
The following code can be run to inspect one of the failed intervals:
failedIndeces <- which(anytimeCiError==1)
someIndex <- failedIndeces[3]
oldPar <- setSafeStatsPlotOptionsAndReturnOldOnes();
plot(NULL, xlim=c(0, n), ylim=c(7, 9),
type="l", xlab = "", ylab = "", cex.lab = 1.3,
cex.axis = 1.3)
polygon(c(1:n, n:1), c(allAnytimeCis[someIndex, , 2],
rev(allAnytimeCis[someIndex, , 1])),
col=eColours[2], border=eColours[1], lwd=2,
density = NULL, angle = 20)
lines(c(1, n), c(muTrue, muTrue), lwd=2)
The following code can be used to see all coverage rate profiles.
We provide some further elaboration on how/why an anytime-valid confidence sequence improves over the classical intervals. Of importance is that the widths of the classical confidence intervals is proportional to \(1/\sqrt{n}\). On the other hand, the width of the anytime-valid confidence interval is stretched by the square root of a logarithmic function of \(n\). To see this we plot the widths of the intervals as a function of n.
someA <- 0
nDomain <- 1:30
anytimeCiWidth <- freqCiWidth <- numeric(30)
for (i in nDomain) {
tempResult <- computeConfidenceIntervalZ(
nEff=i, meanObs=8, eType="freq", parameter=1)
freqCiWidth[i] <- tempResult[2]-tempResult[1]
tempResult <- computeConfidenceIntervalZ(
nEff=i, meanObs=8, parameter=designObj$parameter)
anytimeCiWidth[i] <- tempResult[2]-tempResult[1]
}
oldPar <- setSafeStatsPlotOptionsAndReturnOldOnes();
plot(nDomain, anytimeCiWidth, col=eColours[1], type="l",
lwd=2, ylim=c(0, 12))
lines(nDomain, freqCiWidth, col=freqColours[1], lwd=2,
lty=2)
Hence, the anytime-valid confidence sequence is wider than the standard interval. Equivalently, the standard confidence interval is too narrow for it to be anytime-valid.
For anytime-valid confidence sequences the following holds: For gMom small, the interval becomes infinitely wide resulting in more uncertainty. As gMom increases the width also increases, though, slowly. The following code plots the width of an anytime-valid confidence interval as a function of gMom. The following code illustrates the dependency of the interval widths on gMom.
gDomain <- seq(0.01, 100, by=0.01)
anytimeCiWidth <- freqIntervalWidth <- numeric(30)
for (i in seq_along(gDomain)) {
tempResult <- computeConfidenceIntervalZ(
nEff=1, meanObs=8, a=0, g=gDomain[i],
eType="freq", parameter=1)
freqIntervalWidth[i] <- tempResult[2]-tempResult[1]
tempResult <- computeConfidenceIntervalZ(
nEff=1, meanObs=8, parameter=gDomain[i])
anytimeCiWidth[i] <- tempResult[2]-tempResult[1]
}
oldPar <- setSafeStatsPlotOptionsAndReturnOldOnes();
plot(gDomain, anytimeCiWidth, col=eColours[1], type="l",
lwd=2, ylim=c(0, 16), xlab="gMom", ylab="Interval width")
lines(gDomain, freqIntervalWidth, col=freqColours[1],
lwd=3)
The plot provides an indication on how the widths of the two intervals differ for a range of hyper parameter values gMom. The increase in width of the anytime-valid confidence sequence is hardly noticeable, because it increases very slowly (logarithmically).