|
|
||||||||
Education and debate |
Department of Social Medicine, University of Bristol, Bristol BS8 2PR
Jonathan A C Sterne senior lecturer in medical statistics
George Davey Smith professor of clinical epidemiology
Correspondence to: J Sterne jonathan.sterne{at}bristol.ac.uk
Accepted November 9, 2000
| Introduction |
|---|
One contributory factor is that the medical literature shows a strong tendency to accentuate the positive; positive outcomes are more likely to be reported than null results.24 By this means alone a host of purely chance findings will be published, as by conventional reasoning examining 20 associations will produce one result that is "significant at P = 0.05" by chance alone. If only positive findings are published then they may be mistakenly considered to be of importance rather than being the necessary chance results produced by the application of criteria for meaningfulness based on statistical significance. As many studies contain long questionnaires collecting information on hundreds of variables, and measure a wide range of potential outcomes, several false positive findings are virtually guaranteed. The high volume and often contradictory nature5 of medical research findings, however, is not only because of publication bias. A more fundamental problem is the widespread misunderstanding of the nature of statistical significance.
In this paper we consider how the practice of significance testing emerged; an arbitrary division of results as "significant" or "non-significant" (according to the commonly used threshold of P = 0.05) was not the intention of the founders of statistical inference. P values need to be much smaller than 0.05 before they can be considered to provide strong evidence against the null hypothesis; this implies that more powerful studies are needed. Reporting of medical research should continue to move from the idea that results are significant or nonsignificant to the interpretation of findings in the context of the type of study and other available evidence. Editors of medical journals are in an excellent position to encourage such changes, and we conclude with proposed guidelines for reporting and interpretation we want to evaluate.
| P values and significance testinga brief history |
|---|
|
|
|---|
| Summary points P values, or significance levels, measure the strength of the evidence against the null hypothesis; the smaller the P value, the stronger the evidence against the null hypothesis An arbitrary division of results, into "significant" or "non-significant" according to the P value, was not the intention of the founders of statistical inference A P value of 0.05 need not provide strong evidence against the null hypothesis, but it is reasonable to say that P < 0.001 does. In the results sections of papers the precise P value should be presented, without reference to arbitrary thresholds Results of medical research should not be reported as "significant" or "non-significant" but should be interpreted in the context of the type of study and other available evidence. Bias or confounding should always be considered for findings with low P values To stop the discrediting of medical research by chance findings we need more powerful studies
|
Fisher saw the P value as an index measuring the strength of evidence against the null hypothesis (in our example, the hypothesis that the drug does not affect survival rates). He advocated P < 0.05 (5% significance) as a standard level for concluding that there is evidence against the hypothesis tested, though not as an absolute rule. "If P is between 0.1 and 0.9 there is certainly no reason to suspect the hypothesis tested. If it is below 0.02 it is strongly indicated that the hypothesis fails to account for the whole of the facts. We shall not often be astray if we draw a conventional line at 0.05 ...."9 Importantly, Fisher argued strongly that interpretation of the P value was ultimately for the researcher. For example, a P value of around 0.05 might lead to neither belief nor disbelief in the null hypothesis but to a decision to perform another experiment.
Dislike of the subjective interpretation inherent in this approach led Neyman and Pearson to propose what they called "hypothesis tests," which were designed to replace the subjective view of the strength of evidence against the null hypothesis provided by the P value with an objective, decision based approach to the results of experiments.10 Neyman and Pearson argued that there were two types of error that could be made in interpreting the results of an experiment (table 1). Fisher's approach concentrates on the type I error: the probability of rejecting the null hypothesis (that the treatment has no effect) if it is in fact true. Neyman and Pearson were also concerned about the type II error: the probability of accepting the null hypothesis (and thus failing to use the new treatment) when in fact it is false (the treatment works). By fixing, in advance, the rates of type I and type II error, the number of mistakes made over many different experiments would be limited. These ideas will be familiar to anyone who has performed a power calculation to find the number of participants needed in a clinical trial; in such calculations we aim to ensure that the study is large enough to allow both type I and type II error rates to be small.
|
Thus, in the Neyman-Pearson approach we decide on a decision rule for interpreting the results of our experiment in advance, and the result of our analysis is simply the rejection or acceptance of the null hypothesis. In contrast with Fisher's more subjective viewFisher strongly disagreed with the Neyman-Pearson approach11we make no attempt to interpret the P value to assess the strength of evidence against the null hypothesis in an individual study.
To use the Neyman-Pearson approach we must specify a precise alternative hypothesis. In other words it is not enough to say that the treatment works, we have to say by how much the treatment worksfor example, that our drug reduces mortality by 60%. The researcher is free to change the decision rule by specifying the alternative hypothesis and type I and type II error rates, but this must be done in advance of the experiment. Unfortunately researchers find it difficult to live up to these ideals. With the exception of the primary question in randomised trials, they rarely have in mind a precise value of the treatment effect under the alternative hypothesis before they carry out their studies or specify their analyses. Instead, only the easy part of Neyman and Pearson's approachthat the null hypothesis can be rejected if P < 0.05 (type I error rate 5%)has been widely adopted. This has led to the misleading impression that the NeymanPearson approach is similar to Fisher's.
In practice, and partly because of the requirements of regulatory bodies and medical journals,12 the use of statistics in medicine became dominated by a division of results into significant or not significant, with little or no consideration of the type II error rate. Two common and potentially serious consequences of this are that possibly clinically important differences observed in small studies are denoted as nonsignificant and ignored, while all significant findings are assumed to result from real treatment effects.
These problems, noted long ago13 and many times since,1417 led to the successful campaign to augment the presentation of statistical analyses by presenting confidence intervals in addition to, or in place of, P values.1820 By focusing on the results of the individual comparison, confidence intervals should move us away from a mechanistic acceptreject dichotomy. For small studies, they may remind us that our results are consistent with both the null hypothesis and an important beneficial, or harmful, treatment effect (and often both). For P values of around 0.05 they also emphasise the possibility of the effect being much smaller, or larger, than estimated. 95% Confidence intervals, however, implicitly use the 5% cut off, and this still leads to confusion in their interpretation if they are used simply as a means of assessing significance (according to whether the confidence interval includes the null value) rather than to look at a plausible range for the magnitude of the population difference. We suggest that medical researchers should stop thinking of 5% significance (P < 0.05) as having any particular importance. One way to encourage this would be to adopt a different standard confidence level.
| Misinterpretation of P values and significance tests |
|---|
|
|
|---|
Firstly, we will assume that the proportion of null hypotheses that are in fact false is 10%that is, 90% of hypotheses tested are incorrect. This is consistent with the epidemiological literature: by 1985 nearly 300 risk factors for coronary heart disease had been identified, and it is unlikely that more than a small fraction of these actually increase the risk of the disease.21 Our second assumption is that because studies are often too small the average power (= 1 type II error rate) of studies reported in medical literature is 50%. This is consistent with published surveys of the size of trials.2224
Suppose now that we test hypotheses in 1000 studies and reject the null hypothesis if P < 0.05. The first assumption means that in 100 studies the null hypothesis is in fact false. Because the type II error rate is 50% (second assumption) we reject the null hypothesis in 50 of these 100 studies. For the 900 studies in which the null hypothesis is true (that is, there is no treatment effect) we use 5% significance levels and so reject the null hypothesis in 45 (see table 2, adapted from Oakes25).
|
The ideas illustrated in table 2 are similar in spirit to the bayesian approach to statistical inference, in which we start with an a priori belief about the probability of different possible values for the treatment effect and modify this belief in the light of the data. Bayesian arguments have been used to show that the usual P < 0.05 threshold need not constitute strong evidence against the null hypothesis.27,28 Various authors over the years have proposed that more widespread use of bayesian statistics would prevent the mistaken interpretation of P < 0.05 as showing that the null hypothesis is unlikely to be true or even act as a panacea that would dramatically improve the quality of medical research.26,2932 Differences between the dominant ("classic" or "frequentist") and bayesian approaches to statistical inference are summarised in box 1.
| Box 1 Comparison of frequentist and bayesian approaches to statistical inference Let us assume that we want to evaluate whether a new drug improves one year survival after myocardial infarction by using data from a placebo controlled trial. We do this by estimating the risk ratiothe risk of death in patients treated with the new drug divided by the risk of death in the control group. If the risk ratio is 0.5 then the new drug reduces the risk of death by 50%. If the risk ratio is 1 then the drug has no effect. Frequentist statistics Like Mulder and Scully in The X-Files, frequentist statisticians believe that "the truth is out there." We use the data to make inferences about the true (but unknown) population value of the risk ratio The 95% confidence interval gives us a plausible range of values for the population risk ratio; 95% of the times we derive such a range it will contain the true (but unknown) population value The P value is the probability of getting a risk ratio at least as far from the null value of 1 as the one found in our study Bayesian statistics Bayesians take a subjective approach. We start with our prior opinion about the risk ratio, expressed as a probability distribution. We use the data to modify that opinion (we derive the posterior probability distribution for the risk ratio based on both the data and the prior distribution) A 95% credible interval is one that has a 95% chance of containing the population risk ratio The posterior distribution can be used to derive direct probability statements about the risk ratiofor example, the probability that the drug increases the risk of death If our prior opinion about the risk ratio is vague (we consider a wide range of values to be equally likely) then the results of a frequentist analysis are similar to the results of a bayesian analysis; both are based on what statisticians call the likelihood for the data:
|
| How significant is significance? |
|---|
|
|
|---|
It is often perfectly possible to increase the power of studies by increasing either the sample size or the precision of the measurements. Table 3 shows the predictive value of different P value thresholds under different assumptions about both the power of studies and the proportion of meaningful hypotheses. For any choice of P value, the proportion of "significant" results that are false positives is greatly reduced as power increases. Table 3 suggests that unless we are very pessimistic about the proportion of meaningful hypotheses, it is reasonable to regard P values less than 0.001 as providing strong evidence against the null hypothesis.
|
| Interpreting P values: opinions, decisions, and the role of external evidence |
|---|
|
|
|---|
It is rare that studies examine issues about which nothing is already known. Increasing recognition of this is reflected in the growth of formal methods of research synthesis,35 including the presentation of updated meta-analyses in the discussion section of original research papers.36 Here the prior evidence is simply the results of previous studies of the same issue. Other forms of evidence are, of course, admissible: findings from domains as different as animal studies and tissue cultures on the one hand and secular trends and ecological differences in human disease rates on the other will all influence a final decision as to how to act in the light of study findings.37
In many ways the general public is ahead of medical researchers in its interpretation of new "evidence." The reaction to "lifestyle scares" is usually cynicism, which, for many reasons, may well be rational.38 Popular reactions can be seen to reflect a subconscious bayesianism in which the prior belief is that what medical researchers, and particularly epidemiologists, produce is gobbledegook. In medical research the periodic calls for a wholesale switch to the use of bayesian statistical inference have been largely ignored. A major reason is that prior belief can be difficult to quantify. How much weight should be given to a particular constellation of biological evidence as against the concordance of a study finding with international differences in disease rates, for example? Similarly, the predictive value of P < 0.05 for a meaningful hypothesis is easy to calculate on the basis of an assumed proportion of "meaningful" hypotheses in the study domain, but in reality it will be impossible to know what this proportion is. Tables 2 and 3 are, unfortunately, for illustration only. If we try to avoid the problem of quantification of prior evidence by making our prior opinion extremely uncertain then the results of a bayesian analysis become similar to those in a standard analysis. On the other hand, it would be reasonable to interpret P = 0.008 for the main effect in a clinical trial differently to the same P value for one of many findings from an observational study on the basis that the proportion of meaningful hypotheses tested is probably higher in the former case and that bias and confounding are less likely.
| What is to be done? |
|---|
|
|
|---|
While there is no simple or single solution, it is possible to reduce the risk of being misled by the results of hypothesis tests. This lies partly in the hands of journal editors. Important changes in the presentation of statistical analyses were achieved after guidelines insisting on presentation of confidence intervals were introduced during the 1980s. A similar shift in the presentation of hypothesis tests is now required. We suggest that journal editors require that authors of research reports follow the guidelines outlined in box 2.
Box 2 Suggested guidelines for the reporting of results of statistical analyses in medical journals
|
We are grateful to Professor S Goodman, Dr M Hills, and Dr K Abrams for helpful comments on previous versions of the manuscript; this does not imply their endorsement of our views. Bristol is the lead centre of the MRC Health Sevices Research Collaboration.
Funding: None.
Competing interests: Both authors have misused the word significance in the past and may have overestimated the strength of the evidence for their hypotheses.
|
|
|
|
| References |
|---|
|
|
|---|
Nuffield College, Oxford OX1 1NF
D R Cox professor
david.cox{at}nuf.ox.ac.uk
The cartoons in Sterne and Davey Smith's paper describe implicitly a double threat to progress. Firstly, there is the bombardment of an apparently nervous and litigious public with ill based stories. This leads on to the undermining of meticulous studies that may indeed point towards improved health. Statistical methods, sensibly and modestly used, are both some protection against false alarms and, more importantly, an aid, via principles of study design and analysis, to well founded investigations and ultimately to enhanced health.
To comment here on detailed statistical issues would be out of place. While guidelines too rigidly enforced are potentially dangerous, the thoughtful recommendations of box 2 are consistent with mainstream statistical thinking. That is, design to minimise bias is crucial and the estimation of magnitudes of effects, relative risks, or whatever, is central and best done by limits of error, confidence or posterior limits, or estimates and standard errors. Statistical significance testing has a limited role, usually as a supplement to estimates. Quantitative notions of personalistic probability may have some place, especially perhaps in the planning stage of an investigation, but seem out of place in the general reporting of conclusions.
The authors' castigation of the search for subgroup effects in largely null studies is indeed thoroughly justified. All reports of large effects confined, however, to Aston Villa supporters over the age of 75 and living south of Birmingham should go into the wastepaper basket, however great the interest in that particular subgroup, or, in less extreme cases, put into the pile of topics for future independent investigation. More might be made of a limited and preplanned search for effect modifiers, what in statistical jargon rather misleadingly tends to be called interaction. Even the most carefully planned and implemented randomised controlled trial with full compliance estimates only an average effect across the population of patients giving informed consent. The basis for extending the conclusions to different populations and to individual patients often lies primarily in scientific understanding of the mode of action of the treatments concerned but is reinforced by some check of the stability of any effect found, even if such checks are relatively insensitive.
All these issues are essentially ones of public education about the nature of scientific inquiry and the uncertainties involved. As the authors note, modern statistical thinking owes much to the statistician and geneticist R A Fisher, in particular for two books.1,2 In the second, the same year that Karl Popper introduced the hypotheticodeductive method, Fisher wrote "Every experiment may be said to exist only to give the facts the chance of disproving the null hypothesis." On the 25th anniversary of the publication of the first book, Fisher's friend F Yates wrote an assessment of its impact, in particular criticising Fisher for his emphasis on significance testing.3 In one form or another this criticism has been repeated many times since. To distinguish several types of hypothesis that might be tested it helps to understand the issues.4 In the research laboratory it may be possible to set up an experiment for which outcome can be predicted if the understanding of an underlying process is correct. The key issue is then consistency with that prediction. On the other hand, in many epidemiological studies and randomised controlled trials, with rare exceptions (mobile phones and brain tumours, for instance), there may be no reason for expecting the effect to be null. The issue tends more to be whether the direction of an effect has been reasonably firmly established and whether the magnitude of any effect is such as to make it of public health or clinical importance.
| References |
|---|
|
|
|---|
This article has been cited by other articles:
![]() |
A. Korblein and I. Fairlie Letters to the Editor Radiat Prot Dosimetry, January 1, 2010; 138(1): 87 - 88. [Full Text] [PDF] |
||||
![]() |
R. Tamir, R. Dickstein, and M. Huberman Integration of Motor Imagery and Physical Practice in Group Treatment Applied to Subjects With Parkinson's Disease Neurorehabil Neural Repair, January 1, 2007; 21(1): 68 - 75. [Abstract] [PDF] |
||||
![]() |
J E Hilbert, G A Sforzo, and T Swensen The effects of massage on delayed onset muscle soreness Br. J. Sports Med., February 1, 2003; 37(1): 72 - 75. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH | TABLE OF CONTENTS |