II. Interval estimation: Classical confidence intervals and anytime-valid confidence sequences

Alexander Ly, Rosanne Turner, Udo Boehm, Rianne de Heide, Peter Grunwald

Previously we considered the testing problem. We showed that within a tolerable type I error rate alpha, we have sufficient evidence to reject the null, and stop the experiment as soon as e > 1/alpha.

Here we consider the problem of interval estimation. An anytime-valid 95% interval estimate should contain the true value with at least 95% chance –regardless of when, why or even if data collection stopped.

We illustrate the following:

  1. Classical 95% confidence intervals fail to cover the true mean with 95% chance across time
  2. The 95% anytime-valid confidence sequence/intervals will always cover the true mean with at least 95% chance

For this demonstration we use functions from the safestats package and the following colours:

library("safestats")
# freqColours <- c("#E31A1CE6", "#FB9A9980")
freqColours <- c("#FFDA1A", "#DAA52066")
eColours <- c("#1F78B4E6", "#A6CEE380")
alpha <- 0.05

Interval estimation for a normal mean

As a running example we consider the problem of interval estimation of a normal mean with known variance, say, \(\sigma=1\). That is, we want to give an uncertainty estimate of \(\mu\) where

\[\begin{align} X_{i} \overset{\text{iid}}{\sim} \mathcal{N}(\mu, \sigma^{2}). \end{align}\]

For both examples we use the computeConfidenceIntervalZ function from the safestats package, which produces (1) classical confidence intervals if eType="freq", and (2) an anytime-valid confidence sequence if eType="mom".

To assess the quality of these intervals across time, we fix a true data generating mean, say, muTrue=8, and we set sigmaTrue=1. For simplicity we take n=100 and the following code generates a single data set that we use to highlight the difference between the standard confidence intervals and the anytime-valid confidence sequence.

n <- 100
muTrue <- 8
sigmaTrue <- 1
  
set.seed(4)
someDataSingle <- rnorm(n, mean=muTrue,
                        sd=sigmaTrue)

After each observation a confidence interval can be computed yielding n=100 intervals. If any of these n=100 intervals does not contain muTrue we say that the sequence of intervals does not cover across time.

1. Classical 95% confidence intervals fail to cover the true mean with 95% chance across time

The classical confidence interval for a normal mean is directly related to the p-value test z-test, where the z-statistic is defined as

\[\begin{align} Z := \frac{\sqrt{n} (\bar{x} - \mu)}{\sigma} \end{align}\]

All parameter values \(\mu\) in a classical 95% confidence intervals are all those \(\mu\) that did not lead to a rejection at level alpha=5%. “Inverting” the z-test shows that

\[\begin{align} \bar{X} \sim \mathcal{N}(\mu, \frac{1}{n}) . \end{align}\]

A 95% confidence interval can be derived from quantiles of a normal distribution with location \(\bar{x}\) and variance \(1/n\).

One sample path

To assess the performance of the classical confidence interval on the data generated with muTrue=8, we run the following code:

freqCi <- matrix(nrow=n, ncol=2)

meanVector <- 1/(1:n)*cumsum(someDataSingle)

for (i in 1:n) {
  tempResult <- computeConfidenceIntervalZ(nEff=i,
                                           meanObs=meanVector[i],
                                           eType="freq",
                                           parameter=1)
  freqCi[i, ] <- tempResult
}

oldPar <- setSafeStatsPlotOptionsAndReturnOldOnes();

plot(1:n, freqCi[, 1], xlim=c(0, n), ylim=c(7, 9),
     type="l", xlab = "", ylab = "", cex.lab = 1.3, 
     cex.axis = 1.3)

polygon(c(1:n, n:1), c(freqCi[, 2], rev(freqCi[, 1])),
        col=freqColours[2], border=freqColours[1], lwd=2,
        density = NULL, angle = -20)
lines(c(1, n), c(muTrue, muTrue), lwd=2)

The following code shows that between the 21st and 45th observation the true mean is outside the intervals.

# The indices where the lower bound is above muTrue
which(freqCi[, 1] > muTrue)

# No indices where the upper bound is below muTrue
which(freqCi[, 2] < muTrue)

Hence, the classical confidence intervals lead to incorrect conclusions regarding muTrue, if data collection happened to stop in this range.

The behaviour of classical confidence intervals across time under multiple use

The previous code worked for a specific data set/outcomes of a single experiment. Note that this 1 data set yielded 100 intervals. We now repeat the procedure, but over 1,000 data sets/experiments. The code below stores 1,000 times 100 classical confidence intervals in the variable allFreqCis. We will see that far fewer than 950 out of these 1,000 data sets/repeated experiments will cover the true mean across time.

The following code stores a 1,000 times 100 confidence intervals:

mIter <- 1000

set.seed(1)
allData <- matrix(rnorm(mIter*n, mean=muTrue), nrow=mIter)
allFreqCis <- array(dim=c(mIter, n, 2))


# This indicates whether a simulation yielded an interval that 
# does not cover the true mean
freqCiError <- integer(mIter)

# This indicates the first time an interval does not cover the true mean
firstPassageTime <- rep(Inf, times=mIter)

for (sim in 1:mIter) {
  someData <- allData[sim, ]
  meanVector <- 1/(1:n)*cumsum(someData)
  
  for (i in 1:n) {
    tempResult <- computeConfidenceIntervalZ(
      nEff=i, meanObs=meanVector[i], eType="freq",
      parameter=1)
    allFreqCis[sim, i, ] <- tempResult
    
    if ((tempResult[1] > muTrue || tempResult[2] < muTrue)
        && freqCiError[sim] != 1) {
      freqCiError[sim] <- 1
      firstPassageTime[sim] <- i
    }
  }
}

To count the number of data sets/experiments that yielded an interval that did not cover the true mean at observation 1 to 100, we run the following code:

complementCoverageRate <- numeric(n)

for (i in 1:n) {
  complementCoverageRate[i] <- mean(firstPassageTime <= i)
}

freqCoverageRate <- 1-complementCoverageRate

oldPar <- setSafeStatsPlotOptionsAndReturnOldOnes();

plot(1:n, 100*freqCoverageRate, type="l", lwd=2, xlab="n",
     ylab="Coverage rate (%)", col=freqColours[1], ylim=c(60, 100))
lines(c(1, n), 100*c(1-alpha, 1-alpha), lwd=2, lty=2)

This plot reiterates the point that classical methods (i.e. p-values and confidence intervals) are made for one-shot inference. After the first time the confidence interval is computed the tolerable 5% error was already “spent”.

The following code can be run to inspect one of the failed intervals:

failedIndeces <- which(freqCiError==1)

someIndex <- failedIndeces[3]

oldPar <- setSafeStatsPlotOptionsAndReturnOldOnes();

plot(NULL, xlim=c(0, n), ylim=c(7, 9),
     type="l", xlab = "", ylab = "", cex.lab = 1.3, 
     cex.axis = 1.3)

polygon(c(1:n, n:1), c(allFreqCis[someIndex, , 2],
                       rev(allFreqCis[someIndex, , 1])),
        col=freqColours[2], border=freqColours[1], lwd=2,
        density = NULL, angle = -20)
lines(c(1, n), c(muTrue, muTrue), lwd=2)

2. The 95% anytime-valid confidence sequence/intervals will always cover the true mean with at least 95% chance

The anytime-valid confidence sequence inverts the e-variable based on a so-called non-local moment distribution mixture. In other words, the 95% anytime-valid confidence sequence contain all parameter values mu for which the e-variable is less than 1/alpha = 20. Because the e-variable itself retains a type I error rate across time, so will the resulting confidence sequence.

designObj <- designSafeZ(meanDiffMin=0.5,
                         testType="oneSample",
                         sigma=sigmaTrue)
designObj

One sample path

To assess the performance of the anytime-valid confidence sequence on the data with muTrue=8, we run the following code:

anytimeCi <- matrix(nrow=n, ncol=2)

meanVector <- 1/(1:n)*cumsum(someDataSingle)

for (i in 1:n) {
  tempResult <- computeConfidenceIntervalZ(
    nEff=i, parameter=designObj$parameter, 
    meanObs=meanVector[i])
  anytimeCi[i, ] <- tempResult
}

oldPar <- setSafeStatsPlotOptionsAndReturnOldOnes();

plot(NULL, xlim=c(0, n), ylim=c(7, 9),
     type="l", xlab = "", ylab = "", cex.lab = 1.3, 
     cex.axis = 1.3)
polygon(c(1:n, n:1), c(anytimeCi[, 2], rev(anytimeCi[, 1])),
        col=eColours[2], border=eColours[1], lwd=2,
        density = NULL, angle = -20)
lines(c(1, n), c(muTrue, muTrue), lwd=2)

The following code shows that the anytime-valid confidence sequence covers the true mean at each moment in time.

# No indices where the lower bound is above muTrue
which(anytimeCi[, 1] > muTrue)

# No indices where the upper bound is below muTrue
which(anytimeCi[, 2] < muTrue)

The anytime-valid intervals would thus lead to the correct conclusion that muTrue is a viable value for the true mean, regardless if, whether, or why data collection is stopped. To compare the two intervals we run the following code.

oldPar <- setSafeStatsPlotOptionsAndReturnOldOnes();

plot(NULL, xlim=c(0, n), ylim=c(7, 9),
     type="l", xlab = "", ylab = "", cex.lab = 1.3, 
     cex.axis = 1.3)

polygon(c(1:n, n:1), c(anytimeCi[, 2], rev(anytimeCi[, 1])),
        col=eColours[2], border=eColours[1], lwd=2,
        density = NULL, angle = -20)
polygon(c(1:n, n:1), c(freqCi[, 2], rev(freqCi[, 1])),
        col=freqColours[2], border=freqColours[1], lwd=2,
        density = 60, angle = -20)
lines(c(1, n), c(muTrue, muTrue), lwd=2)

The behaviour of anytime-valid confidence sequences across time under multiple use

The previous code worked for a specific data set/outcomes of a single experiment. Note that this 1 data set yielded 100 intervals. We now repeat the procedure, but over 1,000 data sets/experiments. The code below stores 1,000 times 100 credible intervals in the variable allAnytimeCis.

We will see that more than 950 out of these 1,000 data sets/repeated experiments will cover the true mean across time.

mIter <- 1000

set.seed(1)
allData <- matrix(rnorm(mIter*n, mean=muTrue), nrow=mIter)
allAnytimeCis <- array(dim=c(mIter, n, 2))


# This indicates whether a simulation yielded an interval that 
# does not cover the true mean
anytimeCiError <- integer(mIter)

# This indicates the first time an interval does not cover the true mean
firstPassageTime <- rep(Inf, times=mIter)

for (sim in 1:mIter) {
  someData <- allData[sim, ]
  meanVector <- 1/(1:n)*cumsum(someData)
  
  for (i in 1:n) {
    tempResult <- computeConfidenceIntervalZ(
      nEff=i, meanObs=meanVector[i],
      parameter=designObj$parameter)
    
    allAnytimeCis[sim, i, ] <- tempResult
    
    if ((tempResult[1] > muTrue || tempResult[2] < muTrue) 
        && anytimeCiError[sim] != 1) {
      anytimeCiError[sim] <- 1
      firstPassageTime[sim] <- i
    }
  }
}

To count the number of data sets/experiments that yielded an interval that did not cover the true mean at observation 1 to 100, we run the following code:

complementCoverageRate <- numeric(n)

for (i in 1:n) {
  complementCoverageRate[i] <- mean(firstPassageTime <= i)
}

anytimeCiCoverageRate <- 1-complementCoverageRate

oldPar <- setSafeStatsPlotOptionsAndReturnOldOnes();

plot(1:n, 100*anytimeCiCoverageRate, type="l", lwd=2,
     xlab="n", ylab="Coverage rate (%)", col=eColours[1],
     ylim=c(60, 100))
lines(c(1, n), 100*c(1-alpha, 1-alpha), lwd=2, lty=2)

This plot shows that the anytime-valid confidence sequences cover the true mean under repeated use as promised.

The following code can be run to inspect one of the failed intervals:

failedIndeces <- which(anytimeCiError==1)

someIndex <- failedIndeces[3]

oldPar <- setSafeStatsPlotOptionsAndReturnOldOnes();

plot(NULL, xlim=c(0, n), ylim=c(7, 9),
     type="l", xlab = "", ylab = "", cex.lab = 1.3, 
     cex.axis = 1.3)

polygon(c(1:n, n:1), c(allAnytimeCis[someIndex, , 2],
                       rev(allAnytimeCis[someIndex, , 1])),
        col=eColours[2], border=eColours[1], lwd=2,
        density = NULL, angle = 20)
lines(c(1, n), c(muTrue, muTrue), lwd=2)

The following code can be used to see all coverage rate profiles.

oldPar <- setSafeStatsPlotOptionsAndReturnOldOnes();

plot(1:n, 100*anytimeCiCoverageRate, type="l", lwd=2,
     xlab="n", ylab="Coverage rate (%)", col=eColours[1],
     ylim=c(60, 100))
lines(1:n, 100*freqCoverageRate, col=freqColours[1], lwd=2)
lines(c(1, n), 100*c(1-alpha, 1-alpha), lwd=2, lty=2)

Sensitivity wrt to the parameter gMom: Width of the confidence intervals versus anytime-valid confidence intervals

We provide some further elaboration on how/why an anytime-valid confidence sequence improves over the classical intervals. Of importance is that the widths of the classical confidence intervals is proportional to \(1/\sqrt{n}\). On the other hand, the width of the anytime-valid confidence interval is stretched by the square root of a logarithmic function of \(n\). To see this we plot the widths of the intervals as a function of n. 

someA <- 0

nDomain <- 1:30
anytimeCiWidth <- freqCiWidth <- numeric(30)

for (i in nDomain) {
  tempResult <- computeConfidenceIntervalZ(
    nEff=i, meanObs=8,  eType="freq", parameter=1)
  freqCiWidth[i] <- tempResult[2]-tempResult[1]
  
  tempResult <- computeConfidenceIntervalZ(
    nEff=i, meanObs=8, parameter=designObj$parameter)
  anytimeCiWidth[i] <- tempResult[2]-tempResult[1]
}

oldPar <- setSafeStatsPlotOptionsAndReturnOldOnes();
plot(nDomain, anytimeCiWidth, col=eColours[1], type="l",
     lwd=2, ylim=c(0, 12))
lines(nDomain, freqCiWidth, col=freqColours[1], lwd=2,
      lty=2)

Hence, the anytime-valid confidence sequence is wider than the standard interval. Equivalently, the standard confidence interval is too narrow for it to be anytime-valid.

For anytime-valid confidence sequences the following holds: For gMom small, the interval becomes infinitely wide resulting in more uncertainty. As gMom increases the width also increases, though, slowly. The following code plots the width of an anytime-valid confidence interval as a function of gMom. The following code illustrates the dependency of the interval widths on gMom.

gDomain <- seq(0.01, 100, by=0.01)
anytimeCiWidth <- freqIntervalWidth <- numeric(30)

for (i in seq_along(gDomain)) {
  tempResult <- computeConfidenceIntervalZ(
    nEff=1, meanObs=8, a=0, g=gDomain[i], 
    eType="freq", parameter=1)
  
  freqIntervalWidth[i] <- tempResult[2]-tempResult[1]
  
  tempResult <- computeConfidenceIntervalZ(
    nEff=1, meanObs=8, parameter=gDomain[i])
  anytimeCiWidth[i] <- tempResult[2]-tempResult[1]
}

oldPar <- setSafeStatsPlotOptionsAndReturnOldOnes();
plot(gDomain, anytimeCiWidth, col=eColours[1], type="l",
     lwd=2, ylim=c(0, 16), xlab="gMom", ylab="Interval width")
lines(gDomain, freqIntervalWidth, col=freqColours[1],
      lwd=3)

The plot provides an indication on how the widths of the two intervals differ for a range of hyper parameter values gMom. The increase in width of the anytime-valid confidence sequence is hardly noticeable, because it increases very slowly (logarithmically).