Hypothesis Testing

Contents

Hypothesis Testing

Hypothesis Testing is used to determine how much the data supports a hypothesis. e.g. the effectiveness of a drug given the results of clinical trial.

Basic Form of Inference

The data is either explained by hypothesis H or is due to chance.
It’s unlikely the data is due to chance.
Therefore, it’s likely H is true, other things being equal

Example

A wine connoisseur claims she can taste the difference between French and California Cabernets. In a blind taste test she correctly identifies 16 out of 20 glasses of the wines. Can she taste the difference or is she just lucky? The latter hypothesis can be modeled as a probability distribution, giving the probabilities of the possible number of correct identifications. The probability of her making at least 16 correct identifications by chance is 6 out of 1,000. It’s thus reasonable to believe she can taste the difference between the Cabernets, other things being equal.
The inference
- Either the wine connoisseur identified 16 out of 20 glasses because she can taste the difference or it was a matter of chance.
- The probability she identified at least 16 out of 20 glasses by chance is 6 /1000.
- Therefore, it’s very likely she can taste the difference between the wines, other things being equal.
View Probability Distribution

Key Difference between Observational Studies and Randomized Controlled Trials

An observational study investigates a hypothesized causal connection by comparing naturally-occurring groups.
A randomized controlled trial investigates a hypothesized causal connection by comparing randomized groups.
The key difference is how subjects wind up in the groups they’re in.
- In an observational study subjects decide which group they’re in.
- In a randomized controlled trial chance decides which group they’re in.

Observational Study

An observational study investigates a hypothesized causal connection by comparing naturally-occurring groups.
To prevent colds some people take Echinacea, a herb supposed to stimulate the body’s immune system and ward off infections, particularly the common cold.
A hypothetical observational study of Echinacea, specifically a cohort study.
- 100 people taking Echinacea are contacted and asked to contact an assigned doctor if they think they have a cold.
- A control group is created by finding, for each person taking Echinacea, a person not taking Echinacea with the same age, health, and gender.
- After a year data is compiled and results computed.
The hypothetical results:
- 27 of the 100 people not taking Echinacea got colds
- 10 of the 100 Echinacea takers got colds
The odds of this happening by chance are 1 in 500. So it seems reasonable to infer that Echinacea helps prevent colds.
But there’s a problem with this inference. Observational studies are subject to confounding. People who take Echinacea may be more concerned about their health than the general population; and may therefore do additional things to prevent colds, like washing their hands often. It’s thus possible that the Echinacea-takers in the study got fewer colds, not because Echinacea works, but because they washed their hands. Washing hands would therefore be a confounding (or lurking) variable.
The possibility of confounding means you can’t infer that Echinacea prevents colds from the fact that the chance of the 27-versus-10-colds result is 1 out of 500.
Here’s why:
- The argument that Echinacea works is:
  1. Either Echinacea prevents colds or it doesn’t.
  2. If it does, there should be significantly fewer colds among Echinacea-takers than among the controls.
  3. If Echinacea doesn’t prevent colds, there should be no significant difference.
  4. Echinacea-takers had significantly fewer colds, 10 versus 27.
  5. Therefore, Echinacea prevents colds.
- Premise 3 is false.
  - A significant difference in colds may be due to a confounding variable like hand-washing rather than to Echinacea.
- Bottom line:
  - There are three possible explanations of the significant difference in colds:
    - Echinacea prevents colds
    - Chance
    - A confounding variable prevents colds.
  - To prove Echinacea works you have to eliminate the possibility of confounding.
Confounding variables can be controlled for. For example, subjects could be asked to track how often they wash their hands. Hand-washers taking Echinacea would be compared to hand-washers in the control group. Likewise, for non-hand washers
- Suppose 65 Echinacea-takers wash their hands, versus 10 in the control group. If no hand-washers got colds, the breakdown would be:

The percent of hand-washers with colds is zero for both groups. The percent of non-hand-washers with colds for the two groups is virtually the same: 29% versus 30%. Controlling for hand-washing, the results of the study no longer support Echinacea’s efficacy.
You can only control for the variables you know about. No matter how hard you try, the results of an observational study may still be due to an unknown confounding variable. Unknown confounding variables are the Achilles heel of observational studies.
The possibility of unknown confounding variables means you still can’t infer that Echinacea prevents colds from the fact that the probability of the 27-versus-10-colds result is 1/500.

A randomized controlled trial (RCT) investigates a hypothesized causal connection by comparing randomized groups.
A double blind randomized controlled trial (DB-RCT) is the the gold standard for testing causal hypotheses.
A hypothetical DB-RCT of Echinacea.
- 200 subjects are selected.
- A computer randomly divides them into a treatment group taking Echinacea and a control group taking a placebo. Neither the subjects nor experimenters know who’s taking what.
- The subjects take pills for twelve days.
- On day seven they’re challenged with rhinovirus, a nice way of saying that rhinovirus is squirted up their noses.
- On day twelve the subjects are examined to see who has a cold.
- Data is compiled and results computed.
Hypothetical results:
- 27 subjects in the control group got colds
- 10 people in the treatment group got colds.
The DB-RCT solves the problem of confounding variables
- In the observational study, Echinacea-takers may have significantly fewer colds, not because Echinacea works, but because more of them wash their hands. But in the DB-RCT hand-washers are (nearly) equally distributed between groups, since the probability a hand-washer winds up in the Echinacea group is 1/2.
This means that, with confounding ruled out, you can infer that Echinacea helps prevent colds from the 1-out-of-500 chance of the 27-versus-10-colds result.
- The argument is the same as before:
  1. Either Echinacea prevents colds or it doesn’t.
  2. If it does, there should be significantly fewer colds among Echinacea-takers than among the controls.
  3. If Echinacea doesn’t prevent colds, there should be no significant difference.
  4. Echinacea-takers had significantly fewer colds, 10 versus 27.
  5. Therefore, Echinacea prevents colds.
- Except now, premise 3 is true. Randomizing subjects has eliminated confounding.
- View Inference in Bayesian Form
Bottom line
- There are only two possible explanations of the significant difference in colds
  - Echinacea prevents colds
  - Chance
- By eliminating chance, you prove that Echinacea helps prevent colds.
A DB-RCT has the added benefit of preventing bias, since neither experimenters nor subjects know who’s in which group.
A drawback of RCTs is that they sometimes raise ethical concerns, e.g. conducting an RCT to determine whether inhaling asbestos particles causes lung cancer.

Gina Kolata has a nice overview of observational studies and DB-RCTs.
- nytimes.com/2008/09/30/health/30stud.html
Analysis of observational studies and RCT’s of Hydroxychloroquine for Covid-19
- No New Revelation on Hydroxychloroquine and COVID-19 factcheck.org

P-values

The key statistic in evaluating the results of a statistical hypothesis test is the p-value, developed in the early 1900’s by Karl Pearson and Ronald A. Fisher
- The p-value for outcome O in a hypothesis test of hypothesis H₁ is the probability of getting an outcome by chance that supports H₁ at least as much as O does.

Wine Taste Test

In the taste test above, the p-value for 16 identifications is the probability of identifying 16, 17, 18, 19, or 20 glasses of wine.
The p-value is computed using the Binomial Distribution, which gives the probabilities of the possible outcomes of a series of binary tests, called Bernoulli Trials, where the result in each trial is either one thing or another: success or failure, right or wrong, heads or tails, 1 or 0.
The Binomial Distribution takes two input parameters:
- the number of trials
- the probability of success on a trial.
The graph is for a Binomial Distribution for
- 20 trials
- probability 0.5
The red line is drawn at 16 successes.
The p-value is 0.0059, the probability that the number of successes is to the right of the line.

Online Binomial P-Value Calculator

A curious scenario is where the wine connoisseur gets all 20 identifications wrong. The probability of this happening is one in a million, the same as getting all 20 identifications right. The most plausible explanation, though counter-intuitive, is that she can taste the difference between wines, but has the names mixed up. If so, the p-value is the probability of getting at least 16 identifications right or at most 4 identifications wrong: 0.012. The p-value in this case is two-tailed.

In the 1920’s RA Fisher proposed that a P-value ≤ 0.05 be the criterion of statistical significance. Thus both the one-tailed and two-tailed P-values for the wine connoisseur would be statistically significant.
- wikipedia.org/wiki/P-value
Yet drawing a line at 0.05 or anywhere else seems arbitrary.
- 800 scientists say it’s time to abandon “statistical significance” Vox

Finally, it’s key to remember that the P-value is not the probability of the null hypothesis given the test results. Rather, it’s the probability of the results given the null hypothesis.
Using the notation of conditional probability:
- P-value = P(tests results or better | null hypothesis)
- P-value ≠ P(null hypothesis | test results)
  - “P(….. | —–) ” means the probability that ….. given that —–.
    - View Conditional Probability

An RCT of Echinacea from the New England Journal of Medicine

An RCT of Echinacea was published in the New England Journal of Medicine.
- nejm.org/doi/full/10.1056/NEJMoa044441
The hypothesis being tested was
- Taking Echinacea has an effect on infection with a rhinovirus.
Results

The paper calculated a two-tailed p-value of 0.46.
The conclusion of the study was that Echinacea does “not have clinically significant effects on infection with a rhinovirus.”

How p-values are calculated

Calculation using Monte Carlo Simulation in Mathematica
- Idea
  - Do the following 100,000 times.
    - Select 52 ones and zeros randomly, where the probability of a one is (42 + 88) / (52 + 103).
      - Ones and zeros represent subjects who get sick or not.
    - Select 103 ones and zeros randomly, where the probability of a one is also (42 + 88) / (52 + 103).
    - Calculate the percentage of ones in each group, i.e. the percent who got sick.
    - Count how many times the absolute difference between the two percents is greater than the observed percent difference of 0.0467.
  - Divide the count by 100,000 to get the p-value,
- Code

Calculation in Mathematica using the standard error and Z-statistic
- Data
  - ColdsTreat = 42;
  - TotalTreat = 52;
  - ColdsControl = 88;
  - TotalControl = 103;
- Proportions with colds
  - PctTreat = ColdsTreat / TotalTreat = 0.81
  - PctControl = ColdsControl / TotalControl = 0.854
- Proportion of pooled data with colds
  - PctPooled = (ColdsTreat + ColdsControl)/(TotalTreat + TotalControl) = 0.83871
- Standard error for proportion of treatment group
  - StdErrTreat = Sqrt[Variance[BinomialDistribution[1, PctPooled ]] / TotalTreat] = 0.051
- Standard error for proportion of control group
  - StdErrControl = Sqrt[Variance[BinomialDistribution[1, PctPooled ]] / TotalControl] = 0.036
- Standard error for difference between proportions
  - StdErrDiff = Sqrt[StdErrTreat^2 + StdErrControl^2] = 0.0626
- Z-statistic
  - ZScore = (PctTreat – PctControl) / StdErrDiff = -0.746
- P-value for Z-statistic
  - PValue = N[Probability[x <= ZScore || x >= -ZScore, x \[Distributed] NormalDistribution[]]] = 0.456
- See stattrek.com/hypothesis-test/difference-in-proportions.aspx

Why is the p-value two-tailed?
- The p-value is two-tailed, meaning that it includes not only extreme cases where the echinacea group has fewer colds than the control group but also extreme cases where the echinacea group has more colds. The latter cases support the hypothesis that echinacea causes colds. What gives?
- The conclusion of the NEJM paper is that Echinacea “had no significant effect on rhinovirus infection.” Having an effect includes both prevention and causation. The two-tailed P-value is thus appropriate.
- The right-tailed P-value, 0.23, is appropriate for the hypothesis that Echinacea prevents colds.

Addendum

RCT Inference in Bayesian Form

Bayes Theorem
Let
- H₁ = Echinacea prevents colds
- H₂ = Echinacea doesn’t prevent colds
- E = The percentage of Echinacea-takers with colds is at least 17 points less than that of non-Echinacea-takers with colds
P(H₁) = 0.5
- This is the probability of H₁, apart from the results of the study. We assume it’s 1/2.
P(H₂)= 0.5
- This is the probability of H₂, apart from the results of the study.
P(E | H₁) = 1
- This is the probability of E if H₁ is true. If Echinacea works there should be significantly fewer colds among Echinacea takers.
P(E | H₂) = 0.002
- This premise is the p-value for the results of the study, that is, if E is true by chance. The premise is true only if the study is an RCT.
Therefore P(H₁| E) = 0.998

Online Bayesian Calculator