I. Testing: P-values and e-variables

Alexander Ly, Rosanne Turner, Udo Boehm, Rianne de Heide, Peter Grunwald

Using the two-sample z-test for comparing two means as a guiding example, we illustrate the following:

  1. Sequential use of p-values over-inflates the type I error
  2. Sequential use of a safe test/e-variable will not over-inflate the type I error

The first point is well-known, but we believe it’s good to see actual examples of this as a wake-up call, as in our experience researchers play down their expectation of this over-inflation in their minds. The main point is that with e-variables type I error control is retained regardless of the sample size. In other words, we can safely monitor the e-value and stop the experiment (early) whenever the e-value provides compelling evidence for the presence over the absence of an effect. By stopping early fewer participants will be put at risk, in particular, those patients who are assigned to the control condition, when in reality the treatment condition obtains more effective results. Safe tests also allow for optional continuation, that is the extension of an experiment regardless of the motivation. For instance, if more funds become available, or if the evidence looks promising and the funding agency, a reviewer, an editor, or manager urges the experimenter to collect more data.

Importantly, for the safe tests presented here neither optional stopping nor continuation leads to the test exceeding the tolerable type I error \(\alpha\). As the results do not depend on the planned, current, or future sample sizes, safe tests allow for anytime valid inferences.

For this demonstration we use functions from the safestats package and the following colours:

library("safestats")
# freqColours <- c("#E31A1CE6", "#FB9A9980")
freqColours <- c("#FFDA1A", "#DAA52066")
eColours <- c("#1F78B4E6", "#A6CEE380")

The third markdown document is concerned with power and the design of experiments.

The two-sample z-test

The running example will be the two-sample z-test discussed before. Suppose a pre-university educational programme is developed that claims to increase IQ scores by 10 points. The null hypothesis is that this educational programme does not change the population average IQ score. To test this null hypothesis, n1 secondary school students are assigned to the treatment group, and n2 students are assigned to the control group. The null and alternative hypotheses are respectively given by

\[\begin{align} \mathcal{H}_{0} : \mu_{1} = \mu_{2} \text{ and } \mathcal{H}_{1} : \mu_{1} \neq \mu_{2} \end{align}\]

The IQ score of each secondary school student is assumed to be drawn from a normal distribution, thus,

\[\begin{align} X_{1i} \overset{\text{iid}}{\sim} \mathcal{N} \left (\mu_{1}, \sigma^2 \right ), \text{ and } X_{2i} \overset{\text{iid}}{\sim} \mathcal{N} \left (\mu_{2}, \sigma^2 \right ) . \end{align}\]

For simplicity we take n1=n2=100. Data under the null hypothesis of no difference in IQ scores between the treatment and the control group can be generated as follows:

n1 <- n2 <- 100
sigmaTrue <- 12
alpha <- 0.05

muGlobal <- 112
  
set.seed(2)
dataGroup1 <- rnorm(n1, mean=muGlobal, sd=sigmaTrue)
set.seed(3)
dataGroup2 <- rnorm(n2, mean=muGlobal, sd=sigmaTrue)

Note that the data are indeed generated from the null as the difference between the two population means is zero. The variable muGlobal can take on any value under the null, and is referred to as a nuisance parameter. In this case both the null and the alternative hypothesis/model are composite.

1. Sequential use of p-values over-inflates the type I error

For the p-value demonstration we use the pValueZTest function from the safestats package. The details provided here are only included for those particularly interested in its derivation. This subsection can be safely skipped, as the use of the pValueZTest function becomes clear in the next subsections.

A p-value analysis relies on the sampling distribution of a test statistic. For the case at hand, the natural statistic of interest is the difference between the sample means. The sample mean \(\bar{X}_{1}\) has sampling distribution

\[\begin{align} \bar{X}_{1} \sim \mathcal{N} \left (\mu_{1}, \frac{\sigma^{2}}{n_{1}} \right ). \end{align}\]

The sample mean of \(\bar{X}_{2}\) has an analogous sampling distribution, but with \(2\) as subscript instead of \(1\). The difference between the two sample means has distribution

\[\begin{align} \bar{X}_{1} - \bar{X}_{2} \sim \mathcal{N} \left (\mu_{1}- \mu_{2}, \big ( \tfrac{1}{n_{1}} + \tfrac{1}{n_{2}} \big ) \sigma^{2} \right ). \end{align}\]

Hence, we can define the z-statistic as the rescaled version

\[\begin{align} Z:= \frac{ \sqrt{n}_{\text{eff}} ( \bar{X}_{1} - \bar{X}_{2})}{\sigma} \sim \mathcal{N} \left (\tfrac{ \sqrt{n}_{\text{eff}} (\mu_{1}- \mu_{2})}{\sigma}, 1 \right ). \end{align}\]

where

\[\begin{align} n_{\text{eff}} = \big ( \frac{1}{n_{1}} + \frac{1}{n_{2}} \big )^{-1} = \frac{n_{1} n_{2}}{n_{1} + n_{2}} \end{align}\]

can be referred to as the effective sample size. Hence, under the null hypothesis \(\mu_{1} = \mu_{2}\), \(Z\) has a standard normal distribution. Its p-value can thus be easily derived from the standard R function pnorm, which is effectively what the functions pValueZTest and pValueFromZStat do.

Sequential analysis

The sequential use of a p-value entails computing a p-value after each observation, and claiming an effect, as soon as the p-value dips below the threshold alpha, say, alpha=0.05. This procedure therefore allows for the possibility to stop an experiment early. The problem, however, is that this procedure is not safe, as it will lead to false rejections with more than 5% chance. This is well-known, but we will show this with a simulation. The algorithmic description of this sequential p-value analysis is as follows:

# # PSEUDO CODE
# # 
# #     THIS CODE DOES NOT RUN
# # 
# # Initiate
# n <- 1
# pValue <- 1
# 
# while (pValue > alpha) {
#   pValue <- computePValueZTest(x=x[1:n], y=y[1:n])
#   
#   if (pValue < alpha) {
#     "Reject the null"
#     stop()
#   } else {
#     "Increase sample size and test again 
#           at the start the start of the while loop"
#     n <- n + 1
#   }
# }

One sample path

We demonstrate this procedure for a single data set of n1=n2=100 samples, and we compute the p-value a 100 times, each time after an observation from each group has been completed.

pValueVector <- vector("numeric", length=n1)

for (i in 1:n1) {
  pValueVector[i] <- pValueZTest(x=dataGroup1[1:i],
                                 y=dataGroup2[1:i],
                                 sigma=sigmaTrue)$pValue
}

oldPar <- setSafeStatsPlotOptionsAndReturnOldOnes();

plot(1:n1, pValueVector, type="l", lwd=2, xlab="n",
     ylab="p-value",
     col=freqColours[1])
abline(h=0.05, lty=2, lwd=2)

Using p < alpha = 0.05 as a stopping rule and an indication of there being an effect leads to the experiment to be stopped early in favour of the presence of an effect, despite the data being sampled under the null.

This incorrect conclusion happened to occur for the specific data set we used, but it can be mathematical proven that this procedure will always yield a significant p-value under the null if we sample indefinitely. This is also known as sampling to a foregone conclusion.

The behaviour of the sequential p-value procedure under multiple/repeated use

The previous code worked for a specific data set/outcomes of a single experiment. Note that this 1 data set yielded 100 p-values. We now repeat the procedure, but for 1,000 data sets/experiments. The code below stores 1,000 times 100 p-values in the variable allPValues.

We will see that far more than 50 out of these 1,000 data sets/repeated experiments will yield a “significant” result if we compare against alpha=0.05, despite the data being sampled under the null: far more than what we would want for a test that should retain a type I error rate guarantee at level 0.05.

The following code stores a 1,000 times 100 p-values:

mIter <- 1000

allData <- generateNormalData(c(n1, n2), muGlobal=muGlobal,
                              nSim=mIter,
                              meanDiffTrue=0, seed=1,
                              sigmaTrue=sigmaTrue)
allPValues <- matrix(nrow=mIter, ncol=n1)

# This indicates whether a simulation yielded a "significant" p-value
pValueUnderAlpha <- vector("integer", length=mIter)

# This indicates the first time an experiment yielded a "significant" p-value
# Default is Inf, which indicates that the p-value didn't dip below alpha
firstPassageTime <- rep(Inf, times=mIter)

# Used to vectorise the computations for the the z-statistic
n1Vector <- 1:n1
n2Vector <- 1:n2
nEffVector <- (1/n1Vector+1/n2Vector)^(-1)


for (sim in 1:mIter) {
  dataGroup1 <- allData$dataGroup1[sim, ]
  dataGroup2 <- allData$dataGroup2[sim, ]
  
  x1BarVector <- 1/n1Vector*cumsum(dataGroup1)
  x2BarVector <- 1/n2Vector*cumsum(dataGroup2)
  zVector <- sqrt(nEffVector)*(x1BarVector - x2BarVector)/sigmaTrue
  
  for (i in 1:n1) {
    currentPValue <- pValueFromZStat(zVector[i])
    allPValues[sim, i] <- currentPValue
    
    if (currentPValue < alpha && pValueUnderAlpha[sim]!=1) {
      pValueUnderAlpha[sim] <- 1
      firstPassageTime[sim] <- i
    }
  }
}

To count the number of data sets/experiments that yielded a false negative result at observations 1 to 100, we run the following code:

numberOfDippingExperimentsAtTimeN <- integer(n1)

for (i in 1:n1) {
  numberOfDippingExperimentsAtTimeN[i] <- sum(firstPassageTime <= i)
}

pValueFalseRejects <-numberOfDippingExperimentsAtTimeN/mIter 

oldPar <- setSafeStatsPlotOptionsAndReturnOldOnes();
plot(1:n1, 100*pValueFalseRejects, type="l", xlab="n",
     ylab="False positive rate (%)",
     lwd=2, col=freqColours[1])
lines(c(1, n1), c(5, 5), lwd=2, lty=2)

The reason why the type I error over-inflates is due to classical p-value methods being developed for fixed designs. For these designs, we have to pre-experimentally set the sample size at which the data collection is both stopped and analysed. Deviating from this plan with multiple looks at the data results in an over-inflation of the type I error. This plot also shows that “all alpha is spent” upon the first look, where the procedure already yielded the tolerable type I error. In other words, the use of classical p-values confines researchers to analysing the data once, and only once.

2. Sequential use of a safe test/e-variable will not over-inflate the type I error

For the e-variable demonstration we rely on the safeZTest function from the safestats package.

To run safeZTest a design object needs to be created. For the problem at hand we provided the design object with a minimal clinically relevant effect size, say, a mean difference of 10 (the claimed increase in IQ score). The following code creates such a design object.

designObj <- designSafeZ(meanDiffMin=10,
                         testType="twoSample",
                         sigma=sigmaTrue)
designObj

Sequential analysis

The sequential use of e-variables entails computing an e-value after each observation, and claiming an effect, as soon as the e-value in favour of the alternative over the null is above the evidence threshold of 1/alpha, say, 20 when alpha=0.05. This procedure is safe, as it will never lead to a false rejection rate of more than 5%. The algorithmic description of the sequential e-value procedure is as follows:

# # PSEUDO CODE
# # 
# #     THIS CODE DOES NOT RUN
# # 
# # Initiate
# n <- 1
# eValue <- 1
# 
# while (eValue < 1/alpha) {
#   eValue <- safeZTest(x=x[1:n], y=y[1:n],
#                       designObj=designObj)
#   
#   if (eValue > 1/alpha) {
#     "Reject the null"
#     stop()
#   } else {
#     "Increase sample size and test again 
#           at the start of the while loop"
#     n <- n + 1
#   }
# }

One sample path

We demonstrate this procedure for a single data set of n1=n2=100 samples, and we monitor the e-variable a 100 times, after an observation of each group.

# Single sample path data set as in the p-value example
set.seed(2)
dataGroup1 <- rnorm(n1, mean=muGlobal, sd=sigmaTrue)
set.seed(3)
dataGroup2 <- rnorm(n2, mean=muGlobal, sd=sigmaTrue)

result <- safeZTest(dataGroup1, dataGroup2, designObj=designObj)
plot(result)

This sample path does not exceed the evidence threshold of 20, and fewer than 5% of the experiments will, even if we increase the sample size indefinitely.

The behaviour of the e-variable procedure under multiple/repeated use

The previous code worked for a specific data set/outcomes of a single experiment. Note that this 1 data set yielded 100 e-values. We now repeat the procedure, but over 1,000 data sets/experiments. The code below stores 1,000 times 100 e-values in the variable allEValues.

We will see that fewer than 50 out of these 1,000 repeated experiments/data sets generated under the null will yield a false negative result, if we reject the null as soon as eValue > 20.

The following code stores 1,000 times 100 e-values:

mIter <- 1000

allData <- generateNormalData(c(n1, n2),
                              muGlobal=muGlobal,
                              nSim=mIter,
                              meanDiffTrue=0, seed=1, sigmaTrue=sigmaTrue)
allEValues <- matrix(nrow=mIter, ncol=n1)

# This indicates whether a simulation yielded e > 1/alpha
eValueOver <- vector("integer", length=mIter)

# This indicates the first time an experiment yielded e > 1/alpha
# Default is Inf, which indicates that the e didn't cross 1/alpha
firstPassageTime <- rep(Inf, times=mIter)

# Used to vectorise the computations for the the z-statistic
n1Vector <- 1:n1
n2Vector <- 1:n2
nEffVector <- (1/n1Vector+1/n2Vector)^(-1)

for (sim in 1:mIter) {
  dataGroup1 <- allData$dataGroup1[sim, ]
  dataGroup2 <- allData$dataGroup2[sim, ]
  
  x1BarVector <- 1/n1Vector*cumsum(dataGroup1)
  x2BarVector <- 1/n2Vector*cumsum(dataGroup2)
  zVector <- sqrt(nEffVector)*(x1BarVector - x2BarVector)/sigmaTrue
  
  for (i in 1:n1) {
    currentEValue <- safeZTestStat(zVector[i], 
                                   parameter=designObj$parameter, 
                                   n1=n1Vector[i], n2=n2Vector[i], 
                                   sigma=sigmaTrue, eType=designObj$eType)$eValue
    
    allEValues[sim, i] <- currentEValue
    
    if (currentEValue > 1/alpha && eValueOver[sim]!=1) {
      eValueOver[sim] <- 1
      firstPassageTime[sim] <- i
    }
  }
}

To count the number of data sets/experiments that yielded a false negative result at observations 1 to 100, we run the following code:

numberOfCrossingExperimentsAtTimeN <- integer(n1)

for (i in 1:n1) {
  numberOfCrossingExperimentsAtTimeN[i] <- sum(firstPassageTime <= i)
}

eValueFalseRejects <- numberOfCrossingExperimentsAtTimeN/mIter 

oldPar <- setSafeStatsPlotOptionsAndReturnOldOnes();
plot(1:n1, 100*eValueFalseRejects, type="l", 
     xlab="n", ylab="False positive rate (%)", lwd=2,
     col=eColours[1], ylim=c(0, 5))
lines(c(1, n1), c(5, 5), lwd=2, lty=2)

Hence, after n1=n2=100 observations, that is, looks at the data, we falsely reject 36 out of the 1,000 times the null (3.6%). This is still below the tolerable alpha level of 5%.

We would like to highlight two points. Firstly, the type I error control that e-values provide is independent of any maximum sample size. If we increase it to n1=n2=200, we get 2 additional false rejections resulting in a total of 38 out of 1,000 false rejections (3.8%) still below the 5%. One might wonder whether the chance of a false rejection will ever cross the tolerable 5% by simply further increasing the sample sizes. In a very general framework, Grunwald, de Heide and Koolen (2019) showed that this will not occur, even if we sample indefinitely under the null.

Secondly, type I error control holds for any (hyper) parameter gMom that was set by design object. The design document provides some insights in how gMom was set.

Concluding remarks

The following plot summarises the false rejections rates of the two procedures after n1=n2=100 observations/looks: In (Ikea) yellow the sequential p-value procedure resulting in a type I error of 38.4%, and in blue the e-value procedure resulting in a type I error of 3.6%:

oldPar <- setSafeStatsPlotOptionsAndReturnOldOnes();
plot(1:n1, 100*pValueFalseRejects, type="l", xlab="n", 
     ylab="False positive rate (%)", 
     bty="n", lwd=2, ylim=c(0, 40), col=freqColours[1])
lines(c(1, n1), c(5, 5), lwd=2, lty=2)
lines(1:n1, 100*eValueFalseRejects, col=eColours[1], 
      lwd=2)

By simulating under the null, we showed that the sequential use of classical p-values can lead to early stopping at the high cost of drawing incorrect conclusions.