Statistical hypothesis test: Difference between revisions

Content deleted Content added
m Rollback edit(s) by Letscontributes (talk): LLM output (UV 0.1.6)
 
(181 intermediate revisions by 81 users not shown)
Line 1:
{{shortShort description|Method of statistical inference}}
[[File:Common_Test_Statistics_Chart.png|thumb|The above image shows a table with some of the most common [[test statistic]]s and their corresponding tests or models.]]
{{Redirect|Critical region|the computer science notion of a "critical section", sometimes called a "critical region"|critical section}}
A '''statistical hypothesis test''' is a method of statistical inference used to decide whether the data provide sufficient evidence to reject a particular hypothesis. A statistical hypothesis test typically involves a calculation of a [[test statistic]]. Then a decision is made, either by comparing the test statistic to a [[Critical value (statistics)|critical value]] or equivalently by evaluating a [[p-value|''p''-value]] computed from the test statistic. Roughly 100 [[list of statistical tests|specialized statistical tests]] are in use and noteworthy.<ref>{{cite book |last1=Lewis |first1=Nancy D. |last2=Lewis |first2=Nigel Da Costa |last3=Lewis |first3=N. D. |title=100 Statistical Tests in R: What to Choose, how to Easily Calculate, with Over 300 Illustrations and Examples |date=2013 |publisher=Heather Hills Press |isbn=978-1-4840-5299-0 |url=https://books.google.com/books?id=wIs7mwEACAAJ |language=en}}</ref><ref>{{cite book |last1=Kanji |first1=Gopal K. |title=100 Statistical Tests |date=18 July 2006 |publisher=SAGE |isbn=978-1-4462-2250-8 |url=https://books.google.com/books?id=c16MhjA4pHgC |language=en}}</ref>
{{use mdy dates|date=November 2016}}
A '''statistical hypothesis''' is a [[hypothesis]] that is testable on the basis of [[Observable variable|observed]] data [[statistical model|modelled]] as the realised values taken by a collection of [[random variable]]s.<ref>Stuart A., Ord K., Arnold S. (1999), ''Kendall's Advanced Theory of Statistics: Volume&nbsp;2A&mdash;Classical Inference & the Linear Model'' ([[Edward Arnold (publisher)|Arnold]]) §20.2.</ref> A set of data is modelled as being realised values of a collection of random variables having a joint probability distribution in some set of possible joint distributions. The hypothesis being tested is exactly that set of possible probability distributions. A '''statistical hypothesis test''' is a method of [[statistical inference]]. An [[alternative hypothesis]] is proposed for the probability distribution of the data, either explicitly or only informally. The comparison of the two models is deemed ''[[statistically significant]]'' if, according to a threshold probability—the significance level—the data would be unlikely to occur if the [[null hypothesis]] were true. A hypothesis test specifies which outcomes of a study may lead to a rejection of the null hypothesis at a pre-specified level of significance, while using a pre-chosen measure of deviation from that hypothesis (the test statistic, or goodness-of-fit measure). The pre-chosen level of significance is the maximal allowed "false positive rate". One wants to control the risk of incorrectly rejecting a true null hypothesis.
==History==
{{See also|History of probability}}
While hypothesis testing was popularized early in the 20th century, early forms were used in the 1700s. The first use is credited to [[John Arbuthnot]] (1710),<ref name="Bellhouse2001">{{Citation|last=Bellhouse|first=P.|title=in Statisticians of the Centuries by C.C. Heyde and E. Seneta|pages=39–42|year=2001|chapter=John Arbuthnot|publisher=Springer|isbn=978-0-387-95329-8}}
</ref> followed by [[Pierre-Simon Laplace]] (1770s), in analyzing the [[human sex ratio]] at birth; see {{slink||Human sex ratio}}.
 
===Choice of null hypothesis===
The process of distinguishing between the null hypothesis and the [[alternative hypothesis]] is aided by considering two types of errors. A [[Type I and type II errors|Type I error]] occurs when a true null hypothesis is rejected. A [[Type I and type II errors|Type II error]] occurs when a false null hypothesis is not rejected.
 
[[Paul Meehl]] has argued that the [[epistemological]] importance of the choice of null hypothesis has gone largely unacknowledged. When the null hypothesis is predicted by theory, a more precise experiment will be a more severe test of the underlying theory. When the null hypothesis defaults to "no difference" or "no effect", a more precise experiment is a less severe test of the theory that motivated performing the experiment.<ref>{{cite journal|last=Meehl|first=P|year=1990|title=Appraising and Amending Theories: The Strategy of Lakatosian Defense and Two Principles That Warrant It|url=http://rhowell.ba.ttu.edu/meehl1.pdf|journal=Psychological Inquiry|volume=1|issue=2|pages=108–141|doi=10.1207/s15327965pli0102_1}}</ref> An examination of the origins of the latter practice may therefore be useful:
Hypothesis tests based on statistical significance are another way of expressing [[confidence interval]]s (more precisely, confidence sets). In other words, every hypothesis test based on significance can be obtained via a confidence interval, and every confidence interval can be obtained via a hypothesis test based on significance.<ref>{{cite book| first= John A. | last= Rice | title= Mathematical Statistics and Data Analysis | edition= 3rd | year= 2007 | publisher= [[Thomson Brooks/Cole]] | at= §9.3}}</ref>
 
'''1778:''' [[Pierre Laplace]] compares the birthrates of boys and girls in multiple European cities. He states: "it is natural to conclude that these possibilities are very nearly in the same ratio". Thus, the null hypothesis in this case that the birthrates of boys and girls should be equal given "conventional wisdom".<ref name="Laplace 1778" />
Significance-based hypothesis testing is the most common framework for statistical hypothesis testing. An alternative framework for statistical hypothesis testing is to specify a set of [[statistical model]]s, one for each candidate hypothesis, and then use [[model selection]] techniques to choose the most appropriate model.<ref>{{Cite book |last1=Burnham |first1=K. P. |last2=Anderson |first2=D. R. |year=2002 |title=Model Selection and Multimodel Inference: A Practical Information-Theoretic Approach |edition=2nd |publisher=Springer-Verlag |isbn=978-0-387-95364-9 |url-access=registration |url=https://archive.org/details/modelselectionmu0000burn }}</ref> The most common selection techniques are based on either [[Akaike information criterion]] (=AIC) or [[Bayesian information criterion]] (=BIC).
 
'''1900:''' [[Karl Pearson]] develops the [[chi squared test]] to determine "whether a given form of frequency curve will effectively describe the samples drawn from a given population." Thus the null hypothesis is that a population is described by some distribution predicted by theory. He uses as an example the numbers of five and sixes in the [[Walter Frank Raphael Weldon|Weldon dice throw data]].<ref name="Pearson 1900">{{cite journal|last=Pearson|first=K|year=1900|title=On the criterion that a given system of deviations from the probable in the case of a correlated system of variables is such that it can be reasonably supposed to have arisen from random sampling|url=http://www.economics.soton.ac.uk/staff/aldrich/1900.pdf|journal=The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science|volume=5|issue=50|pages=157–175|doi=10.1080/14786440009463897}}</ref>
==The testing process==
In the statistics literature, statistical hypothesis testing plays a fundamental role.<ref name=LR/> There are two mathematically equivalent processes that can be used.<ref>{{cite book|last=Triola|first=Mario|title=Elementary statistics|publisher=Addison-Wesley|___location=Boston|year=2001|isbn=978-0-201-61477-0|edition=8|page=[https://archive.org/details/elementarystatis00trio/page/388 388]|url=https://archive.org/details/elementarystatis00trio/page/388}}</ref>
 
'''1904:''' [[Karl Pearson]] develops the concept of "[[contingency table|contingency]]" in order to determine whether outcomes are [[statistical independence|independent]] of a given categorical factor. Here the null hypothesis is by default that two things are unrelated (e.g. scar formation and death rates from smallpox).<ref name="Pearson 1904">{{cite journal|last=Pearson|first=K|year=1904|title=On the Theory of Contingency and Its Relation to Association and Normal Correlation|url=https://archive.org/details/cu31924003064833|journal=Drapers' Company Research Memoirs Biometric Series|volume=1|pages=1–35}}</ref> The null hypothesis in this case is no longer predicted by theory or conventional wisdom, but is instead the [[principle of indifference]] that led [[Ronald Fisher|Fisher]] and others to dismiss the use of "inverse probabilities".<ref>{{cite journal|last=Zabell|first=S|year=1989|title=R. A. Fisher on the History of Inverse Probability|journal=Statistical Science|volume=4|issue=3|pages=247–256|doi=10.1214/ss/1177012488|jstor=2245634|doi-access=free}}</ref>
The usual line of reasoning is as follows:
# There is an initial research hypothesis of which the truth is unknown.
# The first step is to state the relevant '''null''' and '''alternative hypotheses'''. This is important, as mis-stating the hypotheses will muddy the rest of the process.
# The second step is to consider the [[statistical assumption]]s being made about the sample in doing the test; for example, assumptions about the [[statistical independence]] or about the form of the distributions of the observations. This is equally important as invalid assumptions will mean that the results of the test are invalid.
# Decide which test is appropriate, and state the relevant '''[[test statistic]]''' <var>T</var>.
# Derive the distribution of the test statistic under the null hypothesis from the assumptions. In standard cases this will be a well-known result. For example, the test statistic might follow a [[Student's t distribution]] with known degrees of freedom, or a [[normal distribution]] with known mean and variance. If the distribution of the test statistic is completely fixed by the null hypothesis we call the hypothesis simple, otherwise it is called composite.
# Select a significance level (''α''), a probability threshold below which the null hypothesis will be rejected. Common values are 5% and 1%.
# The distribution of the test statistic under the null hypothesis partitions the possible values of <var>T</var> into those for which the null hypothesis is rejected—the so-called ''critical region''—and those for which it is not. The probability of the critical region is ''α''. In the case of a composite null hypothesis, the maximal probability of the critical region is ''α''.
# Compute from the observations the observed value <var>t</var><sub>obs</sub> of the test statistic <var>T</var>.
# Decide to either reject the null hypothesis in favor of the alternative or not reject it. The decision rule is to reject the null hypothesis <var>H</var><sub>0</sub> if the observed value <var>t</var><sub>obs</sub> is in the critical region, and not to reject the null hypothesis otherwise.
 
=={{anchor|Controversy}}Modern origins and early controversy==
A common alternative formulation of this process goes as follows:
Modern significance testing is largely the product of [[Karl Pearson]] ([[p-value|''p''-value]], [[Pearson's chi-squared test]]), [[William Sealy Gosset]] ([[Student's t-distribution]]), and [[Ronald Fisher]] ("[[null hypothesis]]", [[analysis of variance]], "[[statistical significance|significance test]]"), while hypothesis testing was developed by [[Jerzy Neyman]] and [[Egon Pearson]] (son of Karl). Ronald Fisher began his life in statistics as a Bayesian (Zabell 1992), but Fisher soon grew disenchanted with the subjectivity involved (namely use of the [[principle of indifference]] when determining prior probabilities), and sought to provide a more "objective" approach to inductive inference.<ref name="ftp.isds.duke">Raymond Hubbard, [[M. J. Bayarri]], ''[http://ftp.isds.duke.edu/WorkingPapers/03-26.pdf P Values are not Error Probabilities] {{webarchive|url=https://web.archive.org/web/20130904000350/http://ftp.isds.duke.edu/WorkingPapers/03-26.pdf|date=September 4, 2013}}''. A working paper that explains the difference between Fisher's evidential ''p''-value and the Neyman–Pearson Type I error rate <math>\alpha</math>.</ref>
# Compute from the observations the observed value <var>t</var><sub>obs</sub></var> of the test statistic <var>T</var>.
# Calculate the [[p-value|''p''-value]]. This is the probability, under the null hypothesis, of sampling a test statistic at least as extreme as that which was observed (the maximal probability of that event, if the hypothesis is composite).
# Reject the null hypothesis, in favor of the alternative hypothesis, if and only if the ''p''-value is less than (or equal to) the significance level (the selected probability) threshold (''α''), for example 0.05 or 0.01.
 
Fisher emphasized rigorous experimental design and methods to extract a result from few samples assuming [[Gaussian distribution]]s. Neyman (who teamed with the younger Pearson) emphasized mathematical rigor and methods to obtain more results from many samples and a wider range of distributions. Modern hypothesis testing is an inconsistent hybrid of the Fisher vs Neyman/Pearson formulation, methods and terminology developed in the early 20th century.
The former process was advantageous in the past when only tables of test statistics at common probability thresholds were available. It allowed a decision to be made without the calculation of a probability. It was adequate for classwork and for operational use, but it was deficient for reporting results. The latter process relied on extensive tables or on computational support not always available. The explicit calculation of a probability is useful for reporting. The calculations are now trivially performed with appropriate software.
 
Fisher popularized the "significance test". He required a null-hypothesis (corresponding to a population frequency distribution) and a sample. His (now familiar) calculations determined whether to reject the null-hypothesis or not. Significance testing did not utilize an alternative hypothesis so there was no concept of a [[Type II error]] (false negative).
The difference in the two processes applied to the Radioactive suitcase example (below):
 
The ''p''-value was devised as an informal, but objective, index meant to help a researcher determine (based on other knowledge) whether to modify future experiments or strengthen one's [[Fiducial inference|faith]] in the null hypothesis.<ref name="Fisher 1955 69–78">{{cite journal|last=Fisher|first=R|year=1955|title=Statistical Methods and Scientific Induction|url=http://www.phil.vt.edu/dmayo/PhilStatistics/Triad/Fisher%201955.pdf|journal=Journal of the Royal Statistical Society, Series B|volume=17|issue=1|pages=69–78|doi=10.1111/j.2517-6161.1955.tb00180.x}}</ref> Hypothesis testing (and Type I/II errors) was devised by Neyman and Pearson as a more objective alternative to Fisher's ''p''-value, also meant to determine researcher behaviour, but without requiring any [[inductive inference]] by the researcher.<ref name="Neyman 289–337">{{cite journal|last1=Neyman|first1=J|last2=Pearson|first2=E. S.|date=January 1, 1933|title=On the Problem of the most Efficient Tests of Statistical Hypotheses|journal=[[Philosophical Transactions of the Royal Society A]]|volume=231|issue=694–706|pages=289–337|bibcode=1933RSPTA.231..289N|doi=10.1098/rsta.1933.0009|doi-access=free}}</ref><ref>{{cite journal|last=Goodman|first=S N|date=June 15, 1999|title=Toward evidence-based medical statistics. 1: The P Value Fallacy|journal=Ann Intern Med|volume=130|issue=12|pages=995–1004|doi=10.7326/0003-4819-130-12-199906150-00008|pmid=10383371|s2cid=7534212}}</ref>
 
Neyman & Pearson considered a different problem to Fisher (which they called "hypothesis testing"). They initially considered two simple hypotheses (both with frequency distributions). They calculated two probabilities and typically selected the hypothesis associated with the higher probability (the hypothesis more likely to have generated the sample). Their method always selected a hypothesis. It also allowed the calculation of both types of error probabilities.
 
Fisher and Neyman/Pearson clashed bitterly. Neyman/Pearson considered their formulation to be an improved generalization of significance testing (the defining paper<ref name="Neyman 289–337" /> was [[Neyman–Pearson lemma|abstract]]; Mathematicians have generalized and refined the theory for decades<ref name="Lehmann93" />). Fisher thought that it was not applicable to scientific research because often, during the course of the experiment, it is discovered that the initial assumptions about the null hypothesis are questionable due to unexpected sources of error. He believed that the use of rigid reject/accept decisions based on models formulated before data is collected was incompatible with this common scenario faced by scientists and attempts to apply this method to scientific research would lead to mass confusion.<ref>{{cite journal|last=Fisher|first=R N|year=1958|title=The Nature of Probability|url=http://www.york.ac.uk/depts/maths/histstat/fisher272.pdf|journal=Centennial Review|volume=2|pages=261–274|quote=We are quite in danger of sending highly trained and highly intelligent young men out into the world with tables of erroneous numbers under their arms, and with a dense fog in the place where their brains ought to be. In this century, of course, they will be working on guided missiles and advising the medical profession on the control of disease, and there is no limit to the extent to which they could impede every sort of national effort.}}
</ref>
 
The dispute between Fisher and Neyman–Pearson was waged on philosophical grounds, characterized by a philosopher as a dispute over the proper role of models in statistical inference.<ref name="Lenhard">{{cite journal|last=Lenhard|first=Johannes|year=2006|title=Models and Statistical Inference: The Controversy between Fisher and Neyman–Pearson|journal=Br. J. Philos. Sci.|volume=57|pages=69–91|doi=10.1093/bjps/axi152|s2cid=14136146}}</ref>
 
Events intervened: Neyman accepted a position in the [[University of California, Berkeley]] in 1938, breaking his partnership with Pearson and separating the disputants (who had occupied the same building). [[World War II]] provided an intermission in the debate. The dispute between Fisher and Neyman terminated (unresolved after 27 years) with Fisher's death in 1962. Neyman wrote a well-regarded eulogy.<ref>{{cite journal|last1=Neyman|first1=Jerzy|year=1967|title=RA Fisher (1890—1962): An Appreciation.|journal=Science|volume=156|issue=3781|pages=1456–1460|bibcode=1967Sci...156.1456N|doi=10.1126/science.156.3781.1456|pmid=17741062|s2cid=44708120}}</ref> Some of Neyman's later publications reported ''p''-values and significance levels.<ref>{{cite journal|last1=Losavich|first1=J. L.|last2=Neyman|first2=J.|last3=Scott|first3=E. L.|last4=Wells|first4=M. A.|year=1971|title=Hypothetical explanations of the negative apparent effects of cloud seeding in the Whitetop Experiment.|journal=Proceedings of the National Academy of Sciences of the United States of America|volume=68|issue=11|pages=2643–2646|bibcode=1971PNAS...68.2643L|doi=10.1073/pnas.68.11.2643|pmc=389491|pmid=16591951|doi-access=free}}</ref>
 
==={{anchor|NHST}}Null hypothesis significance testing (NHST)===
The modern version of hypothesis testing is generally called the '''null hypothesis significance testing (NHST)'''<ref name=nickerson /> and is a hybrid of the Fisher approach with the Neyman-Pearson approach. In 2000, [[Raymond S. Nickerson]] wrote an article stating that NHST was (at the time) "arguably the most widely used method of analysis of data collected in psychological experiments and has been so for about 70 years" and that it was at the same time "very controversial".<ref name=nickerson />
 
This fusion resulted from confusion by writers of statistical textbooks (as predicted by Fisher) beginning in the 1940s<ref name="Halpin 625–653">{{cite journal|last1=Halpin|first1=P F|last2=Stam|first2=HJ|date=Winter 2006|title=Inductive Inference or Inductive Behavior: Fisher and Neyman: Pearson Approaches to Statistical Testing in Psychological Research (1940–1960)|journal=The American Journal of Psychology|volume=119|issue=4|pages=625–653|doi=10.2307/20445367|jstor=20445367|pmid=17286092}}</ref> (but [[Detection theory|signal detection]], for example, still uses the Neyman/Pearson formulation). Great conceptual differences and many caveats in addition to those mentioned above were ignored. Neyman and Pearson provided the stronger terminology, the more rigorous mathematics and the more consistent philosophy, but the subject taught today in introductory statistics has more similarities with Fisher's method than theirs.<ref name="Gigerenzer">{{cite book|last=Gigerenzer|first=Gerd|title=The Empire of Chance: How Probability Changed Science and Everyday Life|author2=Zeno Swijtink|author3=Theodore Porter|author4=Lorraine Daston|author5=John Beatty|author6=Lorenz Kruger|publisher=Cambridge University Press|year=1989|isbn=978-0-521-39838-1|pages=70–122|chapter=Part 3: The Inference Experts}}</ref>
 
Sometime around 1940,<ref name="Halpin 625–653" /> authors of statistical text books began combining the two approaches by using the ''p''-value in place of the [[test statistic]] (or data) to test against the Neyman–Pearson "significance level".
 
{| class="wikitable"
|+ A comparison between Fisherian, frequentist (Neyman–Pearson)
|-
! #
! Fisher's null hypothesis testing !! Neyman–Pearson decision theory
|-
| 1
| Set up a statistical null hypothesis. The null need not be a nil hypothesis (i.e., zero difference).
| Set up two statistical hypotheses, H1 and H2, and decide about α, β, and sample size before the experiment, based on subjective cost-benefit considerations. These define a rejection region for each hypothesis.
|-
| 2
| Report the exact level of significance (e.g. p = 0.051 or p = 0.049). Do not refer to "accepting" or "rejecting" hypotheses. If the result is "not significant", draw no conclusions and make no decisions, but suspend judgement until further data is available.
| If the data falls into the rejection region of H1, accept H2; otherwise accept H1. Accepting a hypothesis does not mean that you believe in it, but only that you act as if it were true.
|-
| 3
| Use this procedure only if little is known about the problem at hand, and only to draw provisional conclusions in the context of an attempt to understand the experimental situation.
| The usefulness of the procedure is limited among others to situations where you have a disjunction of hypotheses (e.g. either μ1 = 8 or μ2 = 10 is true) and where you can make meaningful cost-benefit trade-offs for choosing alpha and beta.
|}
 
==Philosophy==
Hypothesis testing and philosophy intersect. [[Inferential statistics]], which includes hypothesis testing, is applied probability. Both probability and its application are intertwined with philosophy. Philosopher [[David Hume]] wrote, "All knowledge degenerates into probability." Competing practical definitions of [[Probability#Interpretations|probability]] reflect philosophical differences. The most common application of hypothesis testing is in the scientific interpretation of experimental data, which is naturally studied by the [[philosophy of science]].
 
Fisher and Neyman opposed the subjectivity of probability. Their views contributed to the objective definitions. The core of their historical disagreement was philosophical.
 
Many of the philosophical criticisms of hypothesis testing are discussed by statisticians in other contexts, particularly [[correlation does not imply causation]] and the [[design of experiments]].
Hypothesis testing is of continuing interest to philosophers.<ref name="Lenhard" /><ref name="doi10.1093/bjps/axl003">
{{Cite journal|last1=Mayo|first1=D. G.|last2=Spanos|first2=A.|year=2006|title=Severe Testing as a Basic Concept in a Neyman–Pearson Philosophy of Induction|journal=The British Journal for the Philosophy of Science|volume=57|issue=2|pages=323–357|citeseerx=10.1.1.130.8131|doi=10.1093/bjps/axl003|s2cid=7176653}}</ref>
 
==Education==
{{main|Statistics education}}
 
Statistics is increasingly being taught in schools with hypothesis testing being one of the elements taught.<ref>[http://www.corestandards.org/the-standards/mathematics/hs-statistics-and-probability/introduction/ Mathematics > High School: Statistics & Probability > Introduction] {{webarchive|url=https://archive.today/20120728122912/http://www.corestandards.org/the-standards/mathematics/hs-statistics-and-probability/introduction/|date=July 28, 2012}} Common Core State Standards Initiative (relates to USA students)</ref><ref>[http://www.collegeboard.com/student/testing/ap/sub_stats.html College Board Tests > AP: Subjects > Statistics] The College Board (relates to USA students)</ref> Many conclusions reported in the popular press (political opinion polls to medical studies) are based on statistics. Some writers have stated that statistical analysis of this kind allows for thinking clearly about problems involving mass data, as well as the effective reporting of trends and inferences from said data, but caution that writers for a broad public should have a solid understanding of the field in order to use the terms and concepts correctly.<ref name="Huff8">{{cite book|last=Huff|first=Darrell|url=https://archive.org/details/howtoliewithstat00huff/page/8|title=How to lie with statistics|publisher=Norton|year=1993|isbn=978-0-393-31072-6|___location=New York|page=[https://archive.org/details/howtoliewithstat00huff/page/8 8]}}'Statistical methods and statistical terms are necessary in reporting the mass data of social and economic trends, business conditions, "opinion" polls, the census. But without writers who use the words with honesty and readers who know what they mean, the result can only be semantic nonsense.'</ref><ref name="S&C">{{cite book|last1=Snedecor|first1=George W.|title=Statistical Methods|last2=Cochran|first2=William G.|publisher=Iowa State University Press|year=1967|edition=6|___location=Ames, Iowa|page=3}} "...the basic ideas in statistics assist us in thinking clearly about the problem, provide some guidance about the conditions that must be satisfied if sound inferences are to be made, and enable us to detect many inferences that have no good logical foundation."</ref> An introductory college statistics class places much emphasis on hypothesis testing – perhaps half of the course. Such fields as literature and divinity now include findings based on statistical analysis (see the [[Bible Analyzer]]). An introductory statistics class teaches hypothesis testing as a cookbook process. Hypothesis testing is also taught at the postgraduate level. Statisticians learn how to create good statistical test procedures (like ''z'', Student's ''t'', ''F'' and chi-squared). Statistical hypothesis testing is considered a mature area within statistics,<ref name="Lehmann97" /> but a limited amount of development continues.
 
An academic study states that the cookbook method of teaching introductory statistics leaves no time for history, philosophy or controversy. Hypothesis testing has been taught as received unified method. Surveys showed that graduates of the class were filled with philosophical misconceptions (on all aspects of statistical inference) that persisted among instructors.<ref>{{cite journal|last1=Sotos|first1=Ana Elisa Castro|last2=Vanhoof|first2=Stijn|last3=Noortgate|first3=Wim Van den|last4=Onghena|first4=Patrick|year=2007|title=Students' Misconceptions of Statistical Inference: A Review of the Empirical Evidence from Research on Statistics Education|url=https://lirias.kuleuven.be/bitstream/123456789/136347/1/CastroSotos.pdf|journal=Educational Research Review|volume=2|issue=2|pages=98–113|doi=10.1016/j.edurev.2007.04.001}}</ref> While the problem was addressed more than a decade ago,<ref>{{cite journal|last=Moore|first=David S.|year=1997|title=New Pedagogy and New Content: The Case of Statistics|url=http://www.stat.auckland.ac.nz/~iase/publications/isr/97.Moore.pdf|journal=International Statistical Review|volume=65|issue=2|pages=123–165|doi=10.2307/1403333|jstor=1403333}}</ref> and calls for educational reform continue,<ref>{{Cite journal |last1=Hubbard |first1=Raymond|last2=Armstrong|first2=J. Scott|author-link2=J. Scott Armstrong|year=2006|title=Why We Don't Really Know What Statistical Significance Means: Implications for Educators|journal=Journal of Marketing Education|volume=28 |issue=2|pages=114–120 |doi=10.1177/0273475306288399 |hdl-access=free |hdl=2092/413 |s2cid=34729227}}</ref> students still graduate from statistics classes holding fundamental misconceptions about hypothesis testing.<ref>{{cite journal|last1=Sotos|first1=Ana Elisa Castro|last2=Vanhoof|first2=Stijn|last3=Noortgate|first3=Wim Van den|last4=Onghena|first4=Patrick|year=2009|title=How Confident Are Students in Their Misconceptions about Hypothesis Tests?|journal=Journal of Statistics Education|volume=17|doi=10.1080/10691898.2009.11889514|doi-access=free|number=2}}</ref> Ideas for improving the teaching of hypothesis testing include encouraging students to search for statistical errors in published papers, teaching the history of statistics and emphasizing the controversy in a generally dry subject.<ref name="Gigerenzer 2004 391–408">{{cite book|last=Gigerenzer|first=G.|title=The SAGE Handbook of Quantitative Methodology for the Social Sciences|year=2004|isbn=9780761923596|pages=391–408|chapter=The Null Ritual What You Always Wanted to Know About Significant Testing but Were Afraid to Ask|doi=10.4135/9781412986311|chapter-url=http://library.mpib-berlin.mpg.de/ft/gg/GG_Null_2004.pdf}}</ref>
 
Raymond S. Nickerson commented:
{{block quote|The debate about NHST has its roots in unresolved disagreements among major contributors to the development of theories of inferential statistics on which modern approaches are based. [[Gigerenzer]] et al. (1989) have reviewed in considerable detail the controversy between R. A. Fisher on the one hand and Jerzy Neyman and Egon Pearson on the other as well as the disagreements between both of these views and those of the followers of Thomas Bayes. They noted the remarkable fact that little hint of the historical and ongoing controversy is to be found in most textbooks that are used to teach NHST to its potential users. The resulting lack of an accurate historical perspective and understanding of the complexity and sometimes controversial philosophical foundations of various approaches to statistical inference may go a long way toward explaining the apparent ease with which statistical tests are misused and misinterpreted.<ref name=nickerson />}}
 
==Performing a frequentist hypothesis test in practice==
The typical steps involved in performing a frequentist hypothesis test in practice are:
# Define a hypothesis (claim which is testable using data).
# Select a relevant statistical test with associated [[test statistic]] <var>T</var>.
# Derive the distribution of the test statistic under the null hypothesis from the assumptions. In standard cases this will be a well-known result. For example, the test statistic might follow a [[Student's t distribution]] with known degrees of freedom, or a [[normal distribution]] with known mean and variance.
# Select a significance level (''α''), the maximum acceptable [[false positive rate]]. Common values are 5% and 1%.
# Compute from the observations the observed value <var>t</var><sub>obs</sub> of the test statistic <var>T</var>.
# Decide to either reject the null hypothesis in favor of the alternative or not reject it. The [[Neyman–Pearson lemma|Neyman-Pearson]] decision rule is to reject the null hypothesis <var>H</var><sub>0</sub> if the observed value <var>t</var><sub>obs</sub> is in the critical region, and not to reject the null hypothesis otherwise.<ref>{{Cite journal |date=2005 |title=Testing Statistical Hypotheses |url=https://link.springer.com/book/10.1007/0-387-27605-X |journal=Springer Texts in Statistics |language=en |doi=10.1007/0-387-27605-x |isbn=978-0-387-98864-1 |issn=1431-875X|url-access=subscription }}</ref>
 
=== Practical example ===
The difference in the two processes applied to the radioactive suitcase example (below):
* "The Geiger-counter reading is 10. The limit is 9. Check the suitcase."
* "The Geiger-counter reading is high; 97% of safe suitcases have lower readings. The limit is 95%. Check the suitcase."
The former report is adequate, the latter gives a more detailed explanation of the data and the reason why the suitcase is being checked.
 
Not rejecting the null hypothesis does not mean the null hypothesis is "accepted" per se (though Neyman and Pearson used that word in their original writings; see the [[Statistical hypothesis testing#Interpretation|Interpretation]] section).
 
The processes described here are perfectly adequate for computation. They seriously neglect the [[design of experiments]] considerations.<ref>{{cite book|authorauthor1=Hinkelmann, Klaus|author2=Kempthorne, andOscar [[|author-link2=Oscar Kempthorne|Kempthorne, Oscar]]|year=2008|title=Design and Analysis of Experiments|volume=I and II|edition=Second|publisher=Wiley|isbn=978-0-470-38551-7}}</ref><ref>{{cite book|last=Montgomery|first=Douglas|title=Design and analysis of experiments|publisher=Wiley|___location=Hoboken, N.J.|year=2009|isbn=978-0-470-12866-4}}</ref>
 
It is particularly critical that appropriate sample sizes be estimated before conducting the experiment.
Line 45 ⟶ 105:
 
===Interpretation===
When the null hypothesis is true and statistical assumptions are met, the probability that the p-value will be less than or equal to the significance level <math>\alpha</math> is at most <math>\alpha</math>. This ensures that the hypothesis test maintains its specified false positive rate (provided that statistical assumptions are met).<ref name="LR" />
The ''p''-value is the probability that a given result (or a more significant result) would occur under the null hypothesis. At a significance level of 0.05, the fair coin would be expected to (incorrectly) reject the null hypothesis in about 1 out of every 20 tests. The ''p''-value does not provide the probability that either hypothesis is correct (a common source of confusion).<ref>{{Cite journal|last=Nuzzo|first=Regina|author-link= Regina Nuzzo |date=2014|title=Scientific method: Statistical errors|journal=Nature|volume=506|issue=7487|pages=150–152|bibcode=2014Natur.506..150N|doi=10.1038/506150a|pmid=24522584|doi-access=free}}</ref>
 
The ''p''-value is the probability that a test statistic which is at least as extreme as the one obtained would occur under the null hypothesis. At a significance level of 0.05, a fair coin would be expected to (incorrectly) reject the null hypothesis (that it is fair) in 1 out of 20 tests on average. The ''p''-value does not provide the probability that either the null hypothesis or its opposite is correct (a common source of confusion).<ref>{{Cite journal|last=Nuzzo|first=Regina|author-link= Regina Nuzzo |date=2014|title=Scientific method: Statistical errors|journal=Nature|volume=506|issue=7487|pages=150–152|bibcode=2014Natur.506..150N|doi=10.1038/506150a|pmid=24522584|doi-access=free|hdl=11573/685222|hdl-access=free}}</ref>
If the ''p''-value is less than the chosen significance threshold (equivalently, if the observed test statistic is in the
critical region), then we say the null hypothesis is rejected at the chosen level of significance. If the ''p''-value is ''not'' less than the chosen significance threshold (equivalently, if the observed test statistic is outside the critical region), then the null hypothesis is not rejected.
 
If the ''p''-value is less than the chosen significance threshold (equivalently, if the observed test statistic is in the critical region), then we say the null hypothesis is rejected at the chosen level of significance. If the ''p''-value is ''not'' less than the chosen significance threshold (equivalently, if the observed test statistic is outside the critical region), then the null hypothesis is not rejected at the chosen level of significance.
In the Lady tasting tea example (below), Fisher required the Lady to properly categorize all of the cups of tea to justify the conclusion that the result was unlikely to result from chance. His test revealed that if the lady was effectively guessing at random (the null hypothesis), there was a 1.4% chance that the observed results (perfectly ordered tea) would occur.
 
In the "lady tasting tea" example (below), Fisher required the lady to properly categorize all of the cups of tea to justify the conclusion that the result was unlikely to result from chance. His test revealed that if the lady was effectively guessing at random (the null hypothesis), there was a 1.4% chance that the observed results (perfectly ordered tea) would occur.
Rejecting the hypothesis that a large paw print originated from a bear does not immediately prove the existence of [[Bigfoot]]. Hypothesis testing emphasizes the rejection, which is based on a probability, rather than the acceptance.
 
"The probability of rejecting the null hypothesis is a function of five factors: whether the test is one- or two-tailed, the level of significance, the standard deviation, the amount of deviation from the null hypothesis, and the number of observations."<ref name=bakan66>
{{cite journal
| last = Bakan
| first = David
| title = The test of significance in psychological research
| journal = Psychological Bulletin
| volume = 66 | issue = 6 | pages = 423–437
| year = 1966
| doi=10.1037/h0020412
| pmid = 5974619
}}</ref>
 
===Use and importance===
Line 94 ⟶ 141:
A statistical analysis of misleading data produces misleading conclusions. The issue of data quality can be more subtle. In [[forecasting]] for example, there is no agreement on a measure of forecast accuracy. In the absence of a consensus measurement, no decision based on measurements will be without controversy.
 
Publication bias: Statistically nonsignificant results may be less likely to be published, which can bias the literature.
The book ''[[How to Lie with Statistics]]''<ref>{{cite book|last=Huff|first=Darrell|title=How to lie with statistics|publisher=Norton|___location=New York|year=1993|isbn=978-0-393-31072-6|url=https://archive.org/details/howtoliewithstat00huff}}</ref><ref>{{cite book|last=Huff|first=Darrell|title=How to Lie with Statistics|publisher=Penguin Books|___location=London|year=1991|isbn=978-0-14-013629-6}}</ref> is the most popular book on statistics ever published.<ref name="fiftyyears">"Over the last fifty years, How to Lie with Statistics has sold more copies than any other statistical text." J. M. Steele. "[http://www-stat.wharton.upenn.edu/~steele/Publications/PDF/TN148.pdf "Darrell Huff and Fifty Years of ''How to Lie with Statistics''"]. ''Statistical Science'', 20 (3), 2005, 205–209.</ref> It does not much consider hypothesis
testing, but its cautions are applicable, including: Many claims are made on the basis of samples too small to convince. If a report does not mention sample size, be doubtful.
 
Multiple testing: When multiple true null hypothesis tests are conducted at once without adjustment, the overall probability of Type I error is higher than the nominal alpha level.<ref>{{cite journal |last1=Ranganathan |first1=Priya |last2=Pramesh |first2=C. S |last3=Buyse |first3=Marc |title=Common pitfalls in statistical analysis: The perils of multiple testing |journal=Perspect Clin Res |date=April–June 2016 |volume=7 |issue=2 |pages=106–107 |doi=10.4103/2229-3485.179436 |pmid=27141478|pmc=4840791 |doi-access=free }}</ref>
Hypothesis testing acts as a filter of statistical conclusions; only those results meeting a probability threshold are publishable. Economics also acts as a publication filter; only those results favorable to the author and funding source may be submitted for publication. The impact of filtering on publication is termed [[publication bias]]. A related problem is that of [[multiple testing]] (sometimes linked to [[data mining]]), in which a variety of tests for a variety of possible effects are applied to a single data set and only those yielding a significant result are reported. These are often dealt with by using multiplicity correction procedures that control the [[family wise error rate]] (FWER) or the [[false discovery rate]] (FDR).
 
Those making critical decisions based on the results of a hypothesis test are prudent to look at the details rather than the conclusion alone. In the physical sciences most results are fully accepted only when independently confirmed. The general advice concerning statistics is, "Figures never lie, but liars figure" (anonymous).
 
==Definition of terms==
{{See also|Notation in probability and statistics}}
The following definitions are mainly based on the exposition in the book by Lehmann and Romano:<ref name="LR">{{cite book|title=Testing Statistical Hypotheses|edition=3E|isbn=978-0-387-98864-1|last1=Lehmann|first1=E. L.|first2=Joseph P.|last2=Romano|year=2005|publisher=Springer|___location=New York}}</ref>
 
*'''Statistical hypothesis''': A statement about the parameters describing a [[Statistical population|population]] (not a [[Statistical sample|sample]]).
*Test statistic: A value calculated from a sample without any unknown parameters, often to summarize the sample for comparison purposes.
*{{visible anchor|Simple hypothesis}}: Any hypothesis which specifies the population distribution completely.
*Composite hypothesis: Any hypothesis which does ''not'' specify the population distribution completely.
*[[Null hypothesis]] (H<sub>0</sub>)
*Positive data: Data that enable the investigator to reject a null hypothesis.
*[[Alternative hypothesis]] (H<sub>1</sub>)
[[File:One tailed critical value with significance level alpha.jpg|thumb|260x260px|Suppose the data can be realized from an N(0,1) distribution. For example, with a chosen significance level α = 0.05, from the Z-table, a one-tailed critical value of approximately 1.645 can be obtained. The one-tailed critical value C<sub>α</sub> ≈ 1.645 corresponds to the chosen significance level. The critical region [C<sub>α</sub>, ∞) is realized as the tail of the standard normal distribution.]]
*'''{{vanchor|Critical value}}s''' of a statistical test are the boundaries of the acceptance region of the test.<ref>{{cite book |first1=Ann J. |last1=Hughes |first2=Dennis E. |last2=Grawoig |title=Statistics: A Foundation for Analysis |___location=Reading, Mass. |publisher=Addison-Wesley |year=1971 |isbn=0-201-03021-7 |page=[https://archive.org/details/trent_0116302260611/page/191 191] |url=https://archive.org/details/trent_0116302260611 |url-access=registration }}</ref> The acceptance region is the set of values of the test statistic for which the null hypothesis is not rejected. Depending on the shape of the acceptance region, there can be one or more than one critical value.
**'''{{vanchor|Region of rejection}}''' / '''{{vanchor|Critical region}}''': The set of values of the test statistic for which the null hypothesis is rejected.
*'''[[statistical power|Power of a test]] (1&nbsp;−&nbsp;''β'')'''
* [[Size (statistics)|'''Size''']]: For simple hypotheses, this is the test's probability of ''incorrectly'' rejecting the null hypothesis. The [[false positive]] rate. For composite hypotheses this is the supremum of the probability of rejecting the null hypothesis over all cases covered by the null hypothesis. The complement of the false positive rate is termed '''specificity''' in [[biostatistics]]. ("This is a specific test. Because the result is positive, we can confidently say that the patient has the condition.") See [[sensitivity and specificity]] and [[type I and type II errors]] for exhaustive definitions.
*[[Significance level]] of a test (''α)''
*'''[[p-value|''p''-value]]'''
*'''{{vanchor|Statistical significance test}}''': A predecessor to the statistical hypothesis test (see the Origins section). An experimental result was said to be [[statistical significance|statistically significant]] if a sample was sufficiently inconsistent with the (null) hypothesis. This was variously considered common sense, a pragmatic heuristic for identifying meaningful experimental results, a convention establishing a threshold of statistical evidence or a method for drawing conclusions from data. The statistical hypothesis test added mathematical rigor and philosophical consistency to the concept by making the alternative hypothesis explicit. The term is loosely used for the modern version which is now part of statistical hypothesis testing.
*Conservative test: A test is conservative if, when constructed for a given nominal significance level, the true probability of ''incorrectly'' rejecting the null hypothesis is never greater than the nominal level.
*[[Exact test]]
 
A statistical hypothesis test compares a test statistic (''z'' or ''t'' for examples) to a threshold. The test statistic (the formula found in the table below) is based on optimality. For a fixed level of Type I error rate, use of these statistics minimizes Type II error rates (equivalent to maximizing power). The following terms describe tests in terms of such optimality:
 
*Most powerful test: For a given ''size'' or ''significance level'', the test with the greatest power (probability of rejection) for a given value of the parameter(s) being tested, contained in the alternative hypothesis.
*[[Uniformly most powerful test]] (UMP)
 
==Nonparametric bootstrap hypothesis testing==
{{main|Bootstrapping (statistics)}}
 
Bootstrap-based [[Resampling (statistics)|resampling]] methods can be used for null hypothesis testing. A bootstrap creates numerous simulated samples by randomly resampling (with replacement) the original, combined sample data, assuming the null hypothesis is correct. The bootstrap is very versatile as it is distribution-free and it does not rely on restrictive parametric assumptions, but rather on empirical approximate methods with asymptotic guarantees. Traditional parametric hypothesis tests are more computationally efficient but make stronger structural assumptions. In situations where computing the probability of the test statistic under the null hypothesis is hard or impossible (due to perhaps inconvenience or lack of knowledge of the underlying distribution), the bootstrap offers a viable method for statistical inference.<ref>Hall, P. and Wilson, S.R., 1991. Two guidelines for bootstrap hypothesis testing. Biometrics, pp.757-762.
</ref><ref>Tibshirani, R.J. and Efron, B., 1993. An introduction to the bootstrap. Monographs on statistics and applied probability, 57(1).
</ref><ref>Martin, M.A., 2007. Bootstrap hypothesis testing for some common statistical problems: A critical evaluation of size and power properties. Computational Statistics & Data Analysis, 51(12), pp.6321-6342.
</ref><ref>Horowitz, J.L., 2019. Bootstrap methods in econometrics. Annual Review of Economics, 11, pp.193-224.
I'm</ref>
 
==Examples==
===Human sex ratio===
{{main|Human sex ratio}}
The earliest use of statistical hypothesis testing is generally credited to the question of whether male and female births are equally likely (null hypothesis), which was addressed in the 1700s by [[John Arbuthnot]] (1710),<ref>{{cite journal|author=John Arbuthnot |year=1710|title=An argument for Divine Providence, taken from the constant regularity observed in the births of both sexes|url=http://www.york.ac.uk/depts/maths/histstat/arbuthnot.pdf|journal=[[Philosophical Transactions of the Royal Society of London]] | volume=27| issue=325–336|pages=186–190 | year=1710 | url=http://www.york.ac.uk/depts/maths/histstat/arbuthnot.pdf|doi=10.1098/rstl.1710.0011|issuedoi-access=325–336free|s2cid=186209819|doi-access=free}}</ref> and later by [[Pierre-Simon Laplace]] (1770s).<ref>{{cite book |titlelast1=The Descent of Human Sex Ratio at Birth Brian|first1=Éric|url=https://archive.org/details/descenthumansexr00bria |url-accesstitle=limitedThe |first1=ÉricDescent of Human Sex Ratio at Birth|last1last2=Brian Jaisson|first2=Marie |last2publisher=JaissonSpringer |chapter=Physico-TheologyScience and& Mathematics (1710–1794)Business Media|year=2007|isbn=978-1-4020-6036-6|pages=[https://archive.org/details/descenthumansexr00bria/page/n17 1]–25 |yearchapter=2007Physico-Theology |publisher=Springerand Science & Business MediaMathematics (1710–1794)|isbnurl-access=978-1-4020-6036-6limited}}</ref>
 
Arbuthnot examined birth records in London for each of the 82 years from 1629 to 1710, and applied the [[sign test]], a simple [[non-parametric test]].<ref name="Conover1999">{{Citation|last=Conover|first=W.J.|title=Practical Nonparametric Statistics|pages=157–176|year=1999|chapter=Chapter 3.4: The Sign Test|edition=Third|publisher=Wiley|isbn=978-0-471-16068-7}}</ref><ref name="Sprent1989">{{Citation|last=Sprent|first=P.|title=Applied Nonparametric Statistical Methods|year=1989|edition=Second|publisher=Chapman & Hall|isbn=978-0-412-44980-2}}</ref><ref>{{cite book|last=Stigler|first=Stephen M.|title=The History of Statistics: The Measurement of Uncertainty Before 1900|publisher=Harvard University Press|year=1986|isbn=978-0-67440341-3|pages=[https://archive.org/details/historyofstatist00stig/page/225 225–226]}}</ref> In every year, the number of males born in London exceeded the number of females. Considering more male or more female births as equally likely, the probability of the observed outcome is 0.5<sup>82</sup>, or about 1 in 4,836,000,000,000,000,000,000,000; in modern terms, this is the ''p''-value. Arbuthnot concluded that this is too small to be due to chance and must instead be due to divine providence: "From whence it follows, that it is Art, not Chance, that governs." In modern terms, he rejected the null hypothesis of equally likely male and female births at the ''p''&nbsp;=&nbsp;1/2<sup>82</sup> significance level.
Arbuthnot examined birth records in London for each of the 82 years from 1629 to 1710, and applied the [[sign test]], a simple [[non-parametric test]].<ref name="Conover1999">{{Citation
|last=Conover
|first=W.J.
|title=Practical Nonparametric Statistics
|edition=Third
|year=1999
|publisher=Wiley
|isbn=978-0-471-16068-7
|pages=157–176
|chapter=Chapter 3.4: The Sign Test
}}</ref><ref name="Sprent1989">{{Citation
|last=Sprent
|first=P.
|title=Applied Nonparametric Statistical Methods
|edition=Second
|year=1989
|publisher=Chapman & Hall
|isbn=978-0-412-44980-2
}}</ref><ref>{{cite book |title=The History of Statistics: The Measurement of Uncertainty Before 1900 |first=Stephen M. |last=Stigler |publisher=Harvard University Press |year=1986 |isbn=978-0-67440341-3 |pages=[https://archive.org/details/historyofstatist00stig/page/225 225–226]}}</ref> In every year, the number of males born in London exceeded the number of females. Considering more male or more female births as equally likely, the probability of the observed outcome is 0.5<sup>82</sup>, or about 1 in 4,8360,0000,0000,0000,0000,0000; in modern terms, this is the ''p''-value. Arbuthnot concluded that this is too small to be due to chance and must instead be due to divine providence: "From whence it follows, that it is Art, not Chance, that governs." In modern terms, he rejected the null hypothesis of equally likely male and female births at the ''p''&nbsp;=&nbsp;1/2<sup>82</sup> significance level.
 
Laplace considered the statistics of almost half a million births. The statistics showed an excess of boys compared to girls.<ref name="Laplace 1778">{{cite journal| last=Laplace| first=P.| year=1778|title=Mémoire sur les probabilités|url=https://portal.getty.edu/books/bnf_bd6t54192707f|journal=Mémoires de l'Académie Royale des Sciences de Paris|year=1778| volume=9| pages=227–332| url=http://cerebro.xu.edu/math/Sources/Laplace/memoir_probabilities.pdf}}</ref><ref name="LaplaceReprinted in 1878">{{cite book| last=Laplace| first=P.| title=Oeuvres complètes de Laplace |volume=9|pages=383–488|chapter=Mémoire sur les probabilités (XIX, XX)|journal=Mémoires de l'Académie Royale des Sciences de Paris|year=1778| volume=9 | pages=429–438| chapter-url=http://gallica.bnf.fr/ark:/12148/bpt6k77597p/f386}}</ref> He concluded by calculation of a ''p''-value that the excess was a real, but unexplained, effect.<ref>{{cite book|lastpublisher=StiglerGauthier-Villars|firstyear=Stephen1878–1912}} M.|title=The History ofEnglish Statisticstranslation: The Measurement of Uncertainty before 1900|publisher=Belknap Press of Harvard University Press|___location=Cambridge, Mass|year=1986|isbn=978-0-674-40340-6|page=[https://archive.org/details/historyofstatist00stig/page/134 134]|url=https://archive.org/details/historyofstatist00stig/page/134}}</ref>
{{cite web|last=Laplace|first=P.|title=Mémoire sur les probabilités|translator-first=Richard J.|translator-last=Pulskam|date=August 21, 2010|url=http://cerebro.xu.edu/math/Sources/Laplace/memoir_probabilities.pdf
|archive-date=April 27, 2015|archive-url=https://web.archive.org/web/20150427142452/http://cerebro.xu.edu/math/Sources/Laplace/memoir_probabilities.pdf|url-status=dead}}</ref> He concluded by calculation of a ''p''-value that the excess was a real, but unexplained, effect.<ref>{{cite book|last=Stigler|first=Stephen M.|url=https://archive.org/details/historyofstatist00stig/page/134|title=The History of Statistics: The Measurement of Uncertainty before 1900|publisher=Belknap Press of Harvard University Press|year=1986|isbn=978-0-674-40340-6|___location=Cambridge, Mass|page=[https://archive.org/details/historyofstatist00stig/page/134 134]}}</ref>
 
===Lady tasting tea===
{{main|Lady tasting tea}}
 
In a famous example of hypothesis testing, known as the ''Lady tasting tea'',<ref name="fisher">{{cite book|last=Fisher|first=Sir Ronald A.|last=Fisher|author-linktitle=RonaldThe Fisher|chapter=MathematicsWorld of aMathematics, Ladyvolume Tasting3|publisher=Courier TeaDover Publications|orig-year=19352000|yearisbn=1956978-0-486-41151-4|titleeditor=TheJames WorldRoy Newman|trans-title=Design of Experiments|chapter=Mathematics, volumeof 3a Lady Tasting Tea|editorauthor-link=JamesRonald Roy NewmanFisher|orig-year=1935|chapter-url=https://books.google.com/books?id=oKZwtLQTmNAC&q=%22mathematics+of+a+lady+tasting+tea%22&pg=PA1512|trans-title=Design of Experiments|publisher=Courier Dover Publications|isbn=978-0-486-41151-4}} Originally from Fisher's book ''Design of Experiments''.</ref> Dr. [[Muriel Bristol]], a colleague of Fisher, claimed to be able to tell whether the tea or the milk was added first to a cup. Fisher proposed to give her eight cups, four of each variety, in random order. One could then ask what the probability was for her getting the number she got correct, but just by chance. The null hypothesis was that the Lady had no such ability. The test statistic was a simple count of the number of successes in selecting the 4 cups. The critical region was the single case of 4 successes of 4 possible based on a conventional probability criterion (<&nbsp;5%). A pattern of 4 successes corresponds to 1 out of 70 possible combinations (p≈&nbsp;1.4%). Fisher asserted that no alternative hypothesis was (ever) required. The lady correctly identified every cup,<ref>{{cite book|last=Box|first=Joan Fisher|title=R.A. Fisher, The Life of a Scientist|year=1978|___location=New York|publisher=Wiley|pageyear=1341978|isbn=978-0-471-09300-8|___location=New York|page=134}}</ref> which would be considered a statistically significant result.
 
===Courtroom trial===
Line 138 ⟶ 204:
In the start of the procedure, there are two hypotheses <math>H_0</math>: "the defendant is not guilty", and <math>H_1</math>: "the defendant is guilty". The first one, <math>H_0</math>, is called the ''[[null hypothesis]]''. The second one, <math>H_1</math>, is called the ''alternative hypothesis''. It is the alternative hypothesis that one hopes to support.
 
The hypothesis of innocence is rejected only when an error is very unlikely, because one doesn'tdoes not want to convict an innocent defendant. Such an error is called ''[[error of the first kind]]'' (i.e., the conviction of an innocent person), and the occurrence of this error is controlled to be rare. As a consequence of this asymmetric behaviour, an ''[[error of the second kind]]'' (acquitting a person who committed the crime), is more common.
 
{|class="wikitable"
Line 155 ⟶ 221:
 
A criminal trial can be regarded as either or both of two decision processes: guilty vs not guilty or evidence vs a threshold ("beyond a reasonable doubt"). In one view, the defendant is judged; in the other view the performance of the prosecution (which bears the burden of proof) is judged. A hypothesis test can be regarded as either a judgment of a hypothesis or as a judgment of evidence.
 
===Philosopher's beans===
The following example was produced by a philosopher describing scientific methods generations before hypothesis testing was
formalized and popularized.<ref>{{cite journal|author=C. S. Peirce|date=August 1878|title=Illustrations of the Logic of Science VI: Deduction, Induction, and Hypothesis|journal=Popular Science Monthly|volume=13|access-date=March 30, 2012|url=http://en.wikisource.org/w/index.php?oldid=3592335}}</ref>
 
<blockquote>
Few beans of this handful are white.<br />
Most beans in this bag are white.<br />
Therefore: Probably, these beans were taken from another bag.<br />
This is an hypothetical inference.
</blockquote>
 
The beans in the bag are the population. The handful are the sample. The null hypothesis is that the sample originated from the population. The criterion for rejecting the null-hypothesis is the "obvious" difference in appearance (an informal difference in the mean). The interesting result is that consideration of a real population and a real sample produced an imaginary bag. The philosopher was considering logic rather than probability. To be a real statistical hypothesis test, this example requires the formalities of a probability calculation and a comparison of that probability to a standard.
 
A simple generalization of the example considers a mixed bag of beans and a handful that contain either very few or very many white beans. The generalization considers both extremes. It requires more calculations and more comparisons to arrive at a formal answer, but the core philosophy is unchanged; If the composition of the handful is greatly different from that of the bag, then the sample probably originated from another bag. The original example is termed a one-sided or a one-tailed test while the generalization is termed a two-sided or two-tailed test.
 
The statement also relies on the inference that the sampling was random. If someone had been picking through the bag to find white beans, then it would explain why the handful had so many white beans, and also explain why the number of white beans in the bag was depleted (although the bag is probably intended to be assumed much larger than one's hand).
 
===Clairvoyant card game===
A person (the subject) is tested for [[clairvoyance]]. They are shown the reverseback face of a randomly chosen playing card 25 times and asked which of the four [[Suit (cards)|suits]] it belongs to. The number of hits, or correct answers, is called ''X''.
 
As we try to find evidence of their clairvoyance, for the time being the null hypothesis is that the person is not clairvoyant.<ref>{{cite book|last1=Jaynes|first1=E. T.|title=Probability theory : the logic of science|date=2007|publisher=Cambridge Univ. Press|___location=Cambridge [u.a.]|isbn=978-0-521-59271-0|edition=5. print.|___location=Cambridge [u.a.]}}</ref> The alternative is: the person is (more or less) clairvoyant.
 
If the null hypothesis is valid, the only thing the test person can do is guess. For every card, the probability (relative frequency) of any single suit appearing is 1/4. If the alternative is valid, the test subject will predict the suit correctly with probability greater than 1/4. We will call the probability of guessing correctly ''p''. The hypotheses, then, are:
Line 185 ⟶ 234:
When the test subject correctly predicts all 25 cards, we will consider them clairvoyant, and reject the null hypothesis. Thus also with 24 or 23 hits. With only 5 or 6 hits, on the other hand, there is no cause to consider them so. But what about 12 hits, or 17 hits? What is the critical number, ''c'', of hits, at which point we consider the subject to be clairvoyant? How do we determine the critical value ''c''? With the choice ''c''=25 (i.e. we only accept clairvoyance when all cards are predicted correctly) we're more critical than with ''c''=10. In the first case almost no test subjects will be recognized to be clairvoyant, in the second case, a certain number will pass the test. In practice, one decides how critical one will be. That is, one decides how often one accepts an error of the first kind – a [[false positive]], or Type I error. With ''c'' = 25 the probability of such an error is:
 
:{{nowrap|<math>P(\text{reject }H_0 \mid H_0 \text{ is valid}) = P\left(X = 25\mid p=\tfracfrac 14\right)=\left(\tfracfrac 14\right)^{25}\approx10^{-15},</math>,}}
 
and hence, very small. The probability of a false positive is the probability of randomly guessing correctly all 25 times.
 
Being less critical, with ''c'' = 10, gives:
 
:{{nowrap|<math>P(\text{reject }H_0 \mid H_0 \text{ is valid}) = P\left(X \ge 10 \mid p=\tfracfrac 14\right) = \sum_{k=10}^{25}P\left(X=k\mid p=\tfracfrac 14\right) = \sum_{k=10}^{25} \binom{25}{k}\left( 1- \tfracfrac 14\right)^{25-k} \left(\tfracfrac 14\right)^k \approx 0{.}0713.</math>.}}
 
Thus, ''c'' = 10 yields a much greater probability of false positive.
Line 197 ⟶ 246:
Before the test is actually performed, the maximum acceptable probability of a Type I error (''α'') is determined. Typically, values in the range of 1% to 5% are selected. (If the maximum acceptable error rate is zero, an infinite number of correct guesses is required.) Depending on this Type 1 error rate, the critical value ''c'' is calculated. For example, if we select an error rate of 1%, ''c'' is calculated thus:
 
:{{nowrap|<math>P(\text{reject }H_0 \mid H_0 \text{ is valid}) = P\left(X \ge c\mid p=\tfracfrac 14\right) \le 0{.}01.</math>.}}
 
From all the numbers c, with this property, we choose the smallest, in order to minimize the probability of a Type II error, a [[false negative]]. For the above example, we select: <math>c=13</math>.
Line 203 ⟶ 252:
But what if the subject did not guess any cards at all? Having zero correct answers is clearly an oddity too. Without any clairvoyant skills the probability.
 
:<math>P(X=0 \mid H_0 \text{ is valid}) = P\left(X = 0\mid p=\tfracfrac 14\right) = \left(1-\tfracfrac 14\right)^{25} \approx 0{.}00075.</math>.
 
This is highly unlikely (less than 1 in a 1000 chance). While the subject can't guess the cards correctly, dismissing H<sub>0</sub> in favour of H<sub>1</sub> would be an error. In fact, the result would suggest a trait on the subject's part of avoiding calling the correct card. A test of this could be formulated: for a selected 1% error rate the subject would have to answer correctly at least twice, for us to believe that card calling is based purely on guessing. -->
 
===Radioactive suitcase===
As an example, consider determining whether a suitcase contains some radioactive material. Placed under a [[Geiger counter]], it produces 10 counts per minute. The null hypothesis is that no radioactive material is in the suitcase and that all measured counts are due to ambient radioactivity typical of the surrounding air and harmless objects. We can then calculate how likely it is that we would observe 10 counts per minute if the null hypothesis were true. If the null hypothesis predicts (say) on average 9 counts per minute, then according to the [[Poisson distribution]] typical for [[radioactive decay]] there is about 41% chance of recording 10 or more counts. Thus we can say that the suitcase is compatible with the null hypothesis (this does not guarantee that there is no radioactive material, just that we don't have enough evidence to suggest there is). On the other hand, if the null hypothesis predicts 3 counts per minute (for which the Poisson distribution predicts only 0.1% chance of recording 10 or more counts) then the suitcase is not compatible with the null hypothesis, and there are likely other factors responsible to produce the measurements.
 
The test does not directly assert the presence of radioactive material. A ''successful'' test asserts that the claim of no radioactive material present is unlikely given the reading (and therefore ...). The double negative (disproving the null hypothesis) of the method is confusing, but using a counter-example to disprove is standard mathematical practice. The attraction of the method is its practicality. We know (from experience) the expected range of counts with only ambient radioactivity present, so we can say that a measurement is ''unusually'' large. Statistics just formalizes the intuitive by using numbers instead of adjectives. We probably do not know the characteristics of the radioactive suitcases; We just assume
that they produce larger readings.
 
To slightly formalize intuition: radioactivity is suspected if the Geiger-count with the suitcase is among or exceeds the greatest (5% or 1%) of the Geiger-counts made with ambient radiation alone. This makes no assumptions about the distribution of counts. Many ambient radiation observations are required to obtain good probability estimates for rare events.
 
The test described here is more fully the null-hypothesis statistical significance test. The null hypothesis represents what we would believe by default, before seeing any evidence. [[Statistical significance]] is a possible finding of the test, declared when the observed [[Sample (statistics)|sample]] is unlikely to have occurred by chance if the null hypothesis were true. The name of the test describes its formulation and its possible outcome. One characteristic of the test is its crisp decision: to reject or not reject the null hypothesis. A calculated value is compared to a threshold, which is determined from the tolerable risk of error.
 
==Definition of terms==
The following definitions are mainly based on the exposition in the book by Lehmann and Romano:<ref name=LR>{{cite book|title=Testing Statistical Hypotheses|edition=3E|isbn=978-0-387-98864-1|last1=Lehmann|first1=E. L.|first2=Joseph P.|last2=Romano|year=2005|publisher=Springer|___location=New York}}</ref>
 
; Statistical hypothesis : A statement about the parameters describing a population (not a sample).
; Statistic : A value calculated from a sample without any unknown parameters, often to summarize the sample for comparison purposes.
; {{visible anchor|Simple hypothesis}} : Any hypothesis which specifies the population distribution completely.
; Composite hypothesis : Any hypothesis which does ''not'' specify the population distribution completely.
; [[Null hypothesis]] (H<sub>0</sub>) : A hypothesis associated with a contradiction to a theory one would like to prove.
; Positive data : Data that enable the investigator to reject a null hypothesis.
; [[Alternative hypothesis]] (H<sub>1</sub>) : A hypothesis (often composite) associated with a theory one would like to prove.
; Statistical test : A procedure whose inputs are samples and whose result is a hypothesis.
; Region of rejection / Critical region: The set of values of the test statistic for which the null hypothesis is rejected.
; [[Critical value#Statistics|Critical value]]: The threshold value of the test statistic for rejecting the null hypothesis.
; [[statistical power|Power of a test]] (1&nbsp;−&nbsp;''β''): The test's probability of correctly rejecting the null hypothesis when the alternative hypothesis is true. The complement of the [[false negative]] rate, ''β''. Power is termed '''sensitivity''' in [[biostatistics]]. ("This is a sensitive test. Because the result is negative, we can confidently say that the patient does not have the condition.") See [[sensitivity and specificity]] and [[Type I and type II errors]] for exhaustive definitions.
; [[Size (statistics)|Size]]: For simple hypotheses, this is the test's probability of ''incorrectly'' rejecting the null hypothesis. The [[false positive]] rate. For composite hypotheses this is the supremum of the probability of rejecting the null hypothesis over all cases covered by the null hypothesis. The complement of the false positive rate is termed '''specificity''' in [[biostatistics]]. ("This is a specific test. Because the result is positive, we can confidently say that the patient has the condition.") See [[sensitivity and specificity]] and [[Type I and type II errors]] for exhaustive definitions.
; Significance level of a test (''α''): It is the upper bound imposed on the size of a test. Its value is chosen by the statistician prior to looking at the data or choosing any particular test to be used. It is the maximum exposure to erroneously rejecting H<sub>0</sub> that they are ready to accept. Testing H<sub>0</sub> at significance level ''α'' means testing H<sub>0</sub> with a test whose size does not exceed ''α''. In most cases, one uses tests whose size is equal to the significance level.
; [[p-value|''p''-value]]: What the probability of observing a test statistic at least as extreme as the one actually observed would be if the null hypothesis were true.
; [[Statistical significance]] test :A predecessor to the statistical hypothesis test (see the Origins section). An experimental result was said to be statistically significant if a sample was sufficiently inconsistent with the (null) hypothesis. This was variously considered common sense, a pragmatic heuristic for identifying meaningful experimental results, a convention establishing a threshold of statistical evidence or a method for drawing conclusions from data. The statistical hypothesis test added mathematical rigor and philosophical consistency to the concept by making the alternative hypothesis explicit. The term is loosely used for the modern version which is now part of statistical hypothesis testing.
; Conservative test : A test is conservative if, when constructed for a given nominal significance level, the true probability of ''incorrectly'' rejecting the null hypothesis is never greater than the nominal level.
; [[Exact test]]: A test in which the significance level or critical value can be computed exactly, i.e., without any approximation. In some contexts this term is restricted to tests applied to [[categorical data]] and to [[permutation tests]], in which computations are carried out by complete enumeration of all possible outcomes and their probabilities.
 
A statistical hypothesis test compares a test statistic (''z'' or ''t'' for examples) to a threshold. The test statistic (the formula found in the table below) is based on optimality. For a fixed level of Type I error rate, use of these statistics minimizes Type II error rates (equivalent to maximizing power). The following terms describe tests in terms of such optimality:
 
; Most powerful test: For a given ''size'' or ''significance level'', the test with the greatest power (probability of rejection) for a given value of the parameter(s) being tested, contained in the alternative hypothesis.
; [[Uniformly most powerful test]] (UMP): A test with the greatest ''power'' for all values of the parameter(s) being tested, contained in the alternative hypothesis.
 
==Common test statistics==
{{main|Test statistic}}
 
== Variations and sub-classes ==
Statistical hypothesis testing is a key technique of both [[frequentist inference]] and [[Bayesian inference]], although the two types of inference have notable differences. Statistical hypothesis tests define a procedure that controls (fixes) the probability of incorrectly ''deciding'' that a default position ([[null hypothesis]]) is incorrect. The procedure is based on how likely it would be for a set of observations to occur if the null hypothesis were true. Note that thisThis probability of making an incorrect decision is ''not'' the probability that the null hypothesis is true, nor whether any specific alternative hypothesis is true. This contrasts with other possible techniques of [[decision theory]] in which the null and [[alternative hypothesis]] are treated on a more equal basis.
 
One naïve [[Bayesian statistics|Bayesian]] approach to hypothesis testing is to base decisions on the [[posterior probability]],<ref>Schervish, M (1996) ''Theory of Statistics'', p. 218. Springer {{isbn|0-387-94546-6}}</ref><ref>{{cite book|title=Reference Manual on Scientific Evidence|publisher=West National Academies Press|chapter=Reference Guide on Statistics|first1=David H.|last1=Kaye|first2=David A.|last2=Freedman|chapter-url=http://www.nap.edu/openbook.php?record_id=13163&page=211|___location=Eagan, MN Washington, D.C|year=2011|edition=3rd|page=259|isbn=978-0-309-21421-6}}</ref> but this fails when comparing point and continuous hypotheses. Other approaches to decision making, such as [[Bayesian decision theory]], attempt to balance the consequences of incorrect decisions across all possibilities, rather than concentrating on a single null hypothesis. A number of other approaches to reaching a decision based on data are available via [[decision theory]] and [[optimal decision]]s, some of which have desirable properties. Hypothesis testing, though, is a dominant approach to data analysis in many fields of science. Extensions to the theory of hypothesis testing include the study of the [[statistical power|power]] of tests, i.e. the probability of correctly rejecting the null hypothesis given that it is false. Such considerations can be used for the purpose of [[sample size determination]] prior to the collection of data.
 
==History==
===Early use===
While hypothesis testing was popularized early in the 20th century, early forms were used in the 1700s. The first use is credited to [[John Arbuthnot]] (1710),<ref name="Bellhouse2001">{{Citation
|last=Bellhouse
|first=P.
|title=in Statisticians of the Centuries by C.C. Heyde and E. Seneta
|year=2001
|publisher=Springer
|isbn=978-0-387-95329-8
|pages=39–42
|chapter=John Arbuthnot}}
</ref> followed by [[Pierre-Simon Laplace]] (1770s), in analyzing the [[human sex ratio]] at birth; see {{slink||Human sex ratio}}.
 
==={{anchor|Controversy}}Modern origins and early controversy===
Modern significance testing is largely the product of [[Karl Pearson]] ([[p-value|''p''-value]], [[Pearson's chi-squared test]]), [[William Sealy Gosset]] ([[Student's t-distribution]]), and [[Ronald Fisher]] ("[[null hypothesis]]", [[analysis of variance]], "[[statistical significance|significance test]]"), while hypothesis testing was developed by [[Jerzy Neyman]] and [[Egon Pearson]] (son of Karl). Ronald Fisher began his life in statistics as a Bayesian (Zabell 1992), but Fisher soon grew disenchanted with the subjectivity involved (namely use of the [[principle of indifference]] when determining prior probabilities), and sought to provide a more "objective" approach to inductive inference.<ref name="ftp.isds.duke">Raymond Hubbard, [[M. J. Bayarri]], ''[http://ftp.isds.duke.edu/WorkingPapers/03-26.pdf P Values are not Error Probabilities] {{webarchive|url=https://web.archive.org/web/20130904000350/http://ftp.isds.duke.edu/WorkingPapers/03-26.pdf |date=September 4, 2013 }}''. A working paper that explains the difference between Fisher's evidential ''p''-value and the Neyman–Pearson Type I error rate <math>\alpha</math>.</ref>
 
Fisher was an agricultural statistician who emphasized rigorous experimental design and methods to extract a result from few samples assuming Gaussian distributions. Neyman (who teamed with the younger Pearson) emphasized mathematical rigor and methods to obtain more results from many samples and a wider range of distributions. Modern hypothesis testing is an inconsistent hybrid of the Fisher vs Neyman/Pearson formulation, methods and terminology developed in the early 20th century.
 
Fisher popularized the "significance test". He required a null-hypothesis (corresponding to a population frequency distribution) and a sample. His (now familiar) calculations determined whether to reject the null-hypothesis or not. Significance testing did not utilize an alternative hypothesis so there was no concept of a Type II error.
 
The ''p''-value was devised as an informal, but objective, index meant to help a researcher determine (based on other knowledge) whether to modify future experiments or strengthen one's [[Fiducial inference|faith]] in the null hypothesis.<ref name="Fisher 1955 69–78">{{cite journal|last=Fisher|first=R|title=Statistical Methods and Scientific Induction|journal=Journal of the Royal Statistical Society, Series B|year=1955 |volume=17|issue=1|pages=69–78|url=http://www.phil.vt.edu/dmayo/PhilStatistics/Triad/Fisher%201955.pdf}}</ref> Hypothesis testing (and Type I/II errors) was devised by Neyman and Pearson as a more objective alternative to Fisher's ''p''-value, also meant to determine researcher behaviour, but without requiring any [[inductive inference]] by the researcher.<ref name="Neyman 289–337">{{cite journal|last1=Neyman|first1=J|title=On the Problem of the most Efficient Tests of Statistical Hypotheses|journal=[[Philosophical Transactions of the Royal Society A]]|date=January 1, 1933|volume=231|issue=694–706|pages=289–337|doi=10.1098/rsta.1933.0009|last2=Pearson|first2=E. S.|bibcode=1933RSPTA.231..289N|doi-access=free}}</ref><ref>{{cite journal|last=Goodman|first=S N|title=Toward evidence-based medical statistics. 1: The P Value Fallacy|journal=Ann Intern Med|date=June 15, 1999|volume=130|issue=12|pages=995–1004|doi=10.7326/0003-4819-130-12-199906150-00008|pmid=10383371|s2cid=7534212}}</ref>
 
Neyman & Pearson considered a different problem (which they called "hypothesis testing"). They initially considered two simple hypotheses (both with frequency distributions). They calculated two probabilities and typically selected the hypothesis associated with the higher probability (the hypothesis more likely to have generated the sample). Their method always selected a hypothesis. It also allowed the calculation of both types of error probabilities.
 
Fisher and Neyman/Pearson clashed bitterly. Neyman/Pearson considered their formulation to be an improved generalization of significance testing.(The defining paper<ref name="Neyman 289–337"/> was [[Neyman–Pearson lemma|abstract]]. Mathematicians have generalized and refined the theory for decades.<ref name="Lehmann93" />) Fisher thought that it was not applicable to scientific research because often, during the course of the experiment, it is discovered that the initial assumptions about the null hypothesis are questionable due to unexpected sources of error. He believed that the use of rigid reject/accept decisions based on models formulated before data is collected was incompatible with this common scenario faced by scientists and attempts to apply this method to scientific research would lead to mass confusion.<ref>{{cite journal|last=Fisher|first=R N|title=The Nature of Probability|journal=Centennial Review|year=1958|volume=2|pages=261–274|url=http://www.york.ac.uk/depts/maths/histstat/fisher272.pdf}}"We are quite in danger of sending highly trained and highly intelligent young men out into the world with tables of erroneous numbers under their arms, and with a dense fog in the place where their brains ought to be. In this century, of course, they will be working on guided missiles and advising the medical profession on the control of disease, and there is no limit to the extent to which they could impede every sort of national effort."
</ref>
 
The dispute between Fisher and Neyman–Pearson was waged on philosophical grounds, characterized by a philosopher as a dispute over the proper role of models in statistical inference.<ref name=Lenhard>{{cite journal|last=Lenhard|first=Johannes|title=Models and Statistical Inference: The Controversy between Fisher and Neyman–Pearson|journal=Br. J. Philos. Sci.|volume=57|pages=69–91|year=2006|doi=10.1093/bjps/axi152|s2cid=14136146}}</ref>
 
Events intervened: Neyman accepted a position in the western hemisphere, breaking his partnership with Pearson and separating disputants (who had occupied the same building) by much of the planetary diameter. World War II provided an intermission in the debate. The dispute between Fisher and Neyman terminated (unresolved after 27 years) with Fisher's death in 1962. Neyman wrote a well-regarded eulogy.<ref>{{cite journal|last1=Neyman|first1=Jerzy|title=RA Fisher (1890—1962): An Appreciation.|journal=Science|volume=156|issue=3781|pages=1456–1460|year=1967|doi=10.1126/science.156.3781.1456|pmid=17741062|bibcode=1967Sci...156.1456N|s2cid=44708120}}</ref> Some of Neyman's later publications reported ''p''-values and significance levels.<ref>{{cite journal|last1=Losavich|first1=J. L.|last2=Neyman|first2=J.|last3=Scott|first3=E. L.|last4=Wells|first4=M. A.|title=Hypothetical explanations of the negative apparent effects of cloud seeding in the Whitetop Experiment.|journal=Proceedings of the National Academy of Sciences of the United States of America|year=1971|volume=68|issue=11|pages=2643–2646|doi=10.1073/pnas.68.11.2643|pmid=16591951|pmc=389491|bibcode=1971PNAS...68.2643L|doi-access=free}}</ref>
 
The modern version of hypothesis testing is a hybrid of the two approaches that resulted from confusion by writers of statistical textbooks (as predicted by Fisher) beginning in the 1940s.<ref name="Halpin 625–653">{{cite journal|last1=Halpin|first1=P F|title=Inductive Inference or Inductive Behavior: Fisher and Neyman: Pearson Approaches to Statistical Testing in Psychological Research (1940–1960)|journal=The American Journal of Psychology|date=Winter 2006 |volume=119|issue=4|pages=625–653|jstor=20445367|doi=10.2307/20445367|pmid=17286092|last2=Stam|first2=HJ}}</ref> (But [[Detection theory|signal detection]], for example, still uses the Neyman/Pearson formulation.) Great conceptual differences and many caveats in addition to those mentioned above were ignored. Neyman and Pearson provided the stronger terminology, the more rigorous mathematics and the more consistent philosophy, but the subject taught today in introductory statistics has more similarities with Fisher's method than theirs.<ref name=Gigerenzer>{{cite book|title=The Empire of Chance: How Probability Changed Science and Everyday Life|last=Gigerenzer|first=Gerd|author2=Zeno Swijtink |author3=Theodore Porter |author4=Lorraine Daston |author5=John Beatty |author6=Lorenz Kruger |year=1989|publisher=Cambridge University Press|chapter=Part 3: The Inference Experts|isbn=978-0-521-39838-1|pages=70–122}}</ref>
 
Sometime around 1940,<ref name="Halpin 625–653" /> authors of statistical text books began combining the two approaches by using the ''p''-value in place of the [[test statistic]] (or data) to test against the Neyman–Pearson "significance level".
 
{|class="wikitable"
|+ A comparison between Fisherian, frequentist (Neyman–Pearson)
|-
! #
! Fisher's null hypothesis testing !! Neyman–Pearson decision theory
|-
| 1
| Set up a statistical null hypothesis. The null need not be a nil hypothesis (i.e., zero difference).
| Set up two statistical hypotheses, H1 and H2, and decide about α, β, and sample size before the experiment, based on subjective cost-benefit considerations. These define a rejection region for each hypothesis.
|-
| 2
| Report the exact level of significance (e.g. p = 0.051 or p = 0.049). Do not use a conventional 5% level, and do not talk about accepting or rejecting hypotheses. If the result is "not significant", draw no conclusions and make no decisions, but suspend judgement until further data is available.
| If the data falls into the rejection region of H1, accept H2; otherwise accept H1. Note that accepting a hypothesis does not mean that you believe in it, but only that you act as if it were true.
|-
| 3
| Use this procedure only if little is known about the problem at hand, and only to draw provisional conclusions in the context of an attempt to understand the experimental situation.
| The usefulness of the procedure is limited among others to situations where you have a disjunction of hypotheses (e.g. either μ1 = 8 or μ2 = 10 is true) and where you can make meaningful cost-benefit trade-offs for choosing alpha and beta.
|}
 
===Early choices of null hypothesis===
[[Paul Meehl]] has argued that the [[epistemological]] importance of the choice of null hypothesis has gone largely unacknowledged. When the null hypothesis is predicted by theory, a more precise experiment will be a more severe test of the underlying theory. When the null hypothesis defaults to "no difference" or "no effect", a more precise experiment is a less severe test of the theory that motivated performing the experiment.<ref>{{cite journal| last=Meehl| first=P| title=Appraising and Amending Theories: The Strategy of Lakatosian Defense and Two Principles That Warrant It|journal=Psychological Inquiry|year=1990| volume=1| issue=2| pages=108–141| url=http://rhowell.ba.ttu.edu/meehl1.pdf| doi=10.1207/s15327965pli0102_1}}</ref> An examination of the origins of the latter practice may therefore be useful:
 
'''1778:''' [[Pierre Laplace]] compares the birthrates of boys and girls in multiple European cities. He states: "it is natural to conclude that these possibilities are very nearly in the same ratio". Thus Laplace's null hypothesis that the birthrates of boys and girls should be equal given "conventional wisdom".<ref name="Laplace 1778"/>
 
'''1900:''' [[Karl Pearson]] develops the [[chi squared test]] to determine "whether a given form of frequency curve will effectively describe the samples drawn from a given population." Thus the null hypothesis is that a population is described by some distribution predicted by theory. He uses as an example the numbers of five and sixes in the [[Walter Frank Raphael Weldon|Weldon dice throw data]].<ref name="Pearson 1900">{{cite journal| last=Pearson| first=K| title= On the criterion that a given system of deviations from the probable in the case of a correlated system of variables is such that it can be reasonably supposed to have arisen from random sampling|year=1900| journal= The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science| volume=5| issue=50| pages=157–175| url=http://www.economics.soton.ac.uk/staff/aldrich/1900.pdf | doi=10.1080/14786440009463897}}</ref>
 
One naïve [[Bayesian statistics|Bayesian]] approach to hypothesis testing is to base decisions on the [[posterior probability]],<ref>Schervish, M (1996) ''Theory of Statistics'', p. 218. Springer {{isbn|0-387-94546-6}}</ref><ref>{{cite book|title=Reference Manual on Scientific Evidence|publisher=West National Academies Press|chapter=Reference Guide on Statistics|first1=David H.|last1=Kaye|first2=David A.|last2=Freedman|chapter-url=http://www.nap.edu/openbook.php?record_id=13163&page=211|___location=Eagan, MN; Washington, D.C.|year=2011|edition=3rd|page=259|isbn=978-0-309-21421-6}}</ref> but this fails when comparing point and continuous hypotheses. Other approaches to decision making, such as [[Bayesian decision theory]], attempt to balance the consequences of incorrect decisions across all possibilities, rather than concentrating on a single null hypothesis. A number of other approaches to reaching a decision based on data are available via [[decision theory]] and [[optimal decision]]s, some of which have desirable properties. Hypothesis testing, though, is a dominant approach to data analysis in many fields of science. Extensions to the theory of hypothesis testing include the study of the [[statistical power|power]] of tests, i.e. the probability of correctly rejecting the null hypothesis given that it is false. Such considerations can be used for the purpose of [[sample size determination]] prior to the collection of data.
'''1904:''' [[Karl Pearson]] develops the concept of "[[contingency table|contingency]]" in order to determine whether outcomes are [[statistical independence|independent]] of a given categorical factor. Here the null hypothesis is by default that two things are unrelated (e.g. scar formation and death rates from smallpox).<ref name="Pearson 1904">{{cite journal| last=Pearson| first=K| title= On the Theory of Contingency and Its Relation to Association and Normal Correlation|year=1904| journal= Drapers' Company Research Memoirs Biometric Series| volume=1| pages=1–35| url=https://archive.org/details/cu31924003064833}}</ref> The null hypothesis in this case is no longer predicted by theory or conventional wisdom, but is instead the [[principle of indifference]] that led [[Ronald Fisher|Fisher]] and others to dismiss the use of "inverse probabilities".<ref>{{cite journal| last=Zabell| first=S| title= R. A. Fisher on the History of Inverse Probability|year=1989| journal= Statistical Science| volume=4| issue=3| pages=247–256| jstor=2245634| doi=10.1214/ss/1177012488| doi-access=free}}</ref>
 
==Null Neyman–Pearson hypothesis statistical significance testing ==
An example of Neyman–Pearson hypothesis testing (or null hypothesis statistical significance testing) can be made by a change to the radioactive suitcase example. If the "suitcase" is actually a shielded container for the transportation of radioactive material, then a test might be used to select among three hypotheses: no radioactive source present, one present, two (all) present. The test could be required for safety, with actions required in each case. The [[Neyman–Pearson lemma]] of hypothesis testing says that a good criterion for the selection of hypotheses is the ratio of their probabilities (a [[likelihood-ratio test|likelihood ratio]]). A simple method of solution is to select the hypothesis with the highest probability for the Geiger counts observed. The typical result matches intuition: few counts imply no source, many counts imply two sources and intermediate counts imply one source. Notice also that usually there are problems for [[Philosophic burden of proof#Proving a negative|proving a negative]]. Null hypotheses should be at least [[Falsifiability|falsifiable]].
 
Neyman–Pearson theory can accommodate both prior probabilities and the costs of actions resulting from decisions.<ref name="Ash">{{cite book | last = Ash | first = Robert | title = Basic probability theory | publisher = Wiley | ___location = New York | year = 1970 | isbn = 978-0471034506 }}Section 8.2</ref> The former allows each test to consider the results of earlier tests (unlike Fisher's significance tests). The latter allows the consideration of economic issues (for example) as well as probabilities. A likelihood ratio remains a good criterion for selecting among hypotheses.
Line 336 ⟶ 283:
 
==Criticism==
{{see also|p-value#MisconceptionsMisuse}}
 
Much of the criticisms of statistical hypothesis testing can be summarized by the following issues:
Criticism of statistical hypothesis testing fills volumes.<ref name=morrison>{{cite book|orig-year=1970|year=2006|title=The Significance Test Controversy|editor1=Morrison, Denton |editor2=Henkel, Ramon |publisher=Aldine Transaction |isbn=978-0-202-30879-1}}</ref><ref>{{cite book|last=Oakes|first=Michael|title=Statistical Inference: A Commentary for the Social and Behavioural Sciences|publisher=Wiley|___location=Chichester New York|year=1986|isbn=978-0471104438}}</ref><ref name=chow>{{cite book|first=Siu L.|last=Chow|year=1997|title=Statistical Significance: Rationale, Validity and Utility|isbn=978-0-7619-5205-3}}</ref><ref name=harlow>{{cite book|year=1997|title=What If There Were No Significance Tests?|editor1=Harlow, Lisa Lavoie |editor2=Stanley A. Mulaik |editor3=James H. Steiger |publisher=Lawrence Erlbaum Associates|isbn=978-0-8058-2634-0}}</ref><ref name=kline>{{cite book|last=Kline|first=Rex|title=Beyond Significance Testing: Reforming Data Analysis Methods in Behavioral Research|publisher=American Psychological Association|___location=Washington, D.C. |year=2004|isbn=9781591471189 }}</ref><ref name=mccloskey>{{cite book|last= McCloskey|first=Deirdre N.|author2=Stephen T. Ziliak |year=2008|title=The Cult of Statistical Significance: How the Standard Error Costs Us Jobs, Justice, and Lives|publisher=University of Michigan Press|isbn=978-0-472-05007-9}}</ref> Much of the criticism can be summarized by the following issues:
* The interpretation of a ''p''-value is dependent upon [[stopping rule]] and definition of multiple comparison. The former often changes during the course of a study and the latter is unavoidably ambiguous. (i.e. "p values depend on both the (data) observed and on the other possible (data) that might have been observed but weren't").<ref>{{cite journal|last=Cornfield|first=Jerome|title=Recent Methodological Contributions to Clinical Trials| journal=American Journal of Epidemiology|volume=104|issue=4|pages=408–421|year=1976|url=http://www.epidemiology.ch/history/PDF%20bg/Cornfield%20J%201976%20recent%20methodological%20contributions.pdf|doi=10.1093/oxfordjournals.aje.a112313|pmid= 788503}}</ref>
* Confusion resulting (in part) from combining the methods of Fisher and Neyman–Pearson which are conceptually distinct.<ref name="Tukey60">{{cite journal|last=Tukey|first=John W.|title=Conclusions vs decisions|journal= Technometrics|volume=26|issue=4|pages=423–433|year=1960|doi=10.1080/00401706.1960.10489909}} "Until we go through the accounts of testing hypotheses, separating [Neyman–Pearson] decision elements from [Fisher] conclusion elements, the intimate mixture of disparate elements will be a continual source of confusion." ... "There is a place for both "doing one's best" and "saying only what is certain," but it is important to know, in each instance, both which one is being done, and which one ought to be done."</ref>
* Emphasis on statistical significance to the exclusion of estimation and confirmation by repeated experiments.<ref>{{cite journal|last=Yates|first=Frank|title=The Influence of Statistical Methods for Research Workers on the Development of the Science of Statistics|journal=Journal of the American Statistical Association|volume=46|issue=253|pages=19–34|year=1951|doi=10.1080/01621459.1951.10500764}} "The emphasis given to formal tests of significance throughout [R.A. Fisher's] Statistical Methods ... has caused scientific research workers to pay undue attention to the results of the tests of significance they perform on their data, particularly data derived from experiments, and too little to the estimates of the magnitude of the effects they are investigating." ... "The emphasis on tests of significance and the consideration of the results of each experiment in isolation, have had the unfortunate consequence that scientific workers have often regarded the execution of a test of significance on an experiment as the ultimate objective."</ref>
* Rigidly requiring statistical significance as a criterion for publication, resulting in [[publication bias]].<ref>{{cite journal|last1=Begg|first1=Colin B.|last2=Berlin|first2=Jesse A.|title=Publication bias: a problem in interpreting medical data|journal=Journal of the Royal Statistical Society, Series A|volume=151|issue=3|pages=419–463|year=1988|doi=10.2307/2982993|jstor=2982993|s2cid=121054702 }}</ref> Most of the criticism is indirect. Rather than being wrong, statistical hypothesis testing is misunderstood, overused and misused.
* When used to detect whether a difference exists between groups, a paradox arises. As improvements are made to experimental design (e.g. increased precision of measurement and sample size), the test becomes more lenient. Unless one accepts the absurd assumption that all sources of noise in the data cancel out completely, the chance of finding statistical significance in either direction approaches 100%.<ref>{{cite journal|last=Meehl|first=Paul E.|title= Theory-Testing in Psychology and Physics: A Methodological Paradox|journal=Philosophy of Science|volume=34|issue=2|pages=103–115|year=1967|url=http://mres.gmu.edu/pmwiki/uploads/Main/Meehl1967.pdf|doi=10.1086/288135|s2cid=96422880| url-status=dead|archive-url=https://web.archive.org/web/20131203010657/http://mres.gmu.edu/pmwiki/uploads/Main/Meehl1967.pdf|archive-date=December 3, 2013|df=mdy-all}} Thirty years later, Meehl acknowledged statistical significance theory to be mathematically sound while continuing to question the default choice of null hypothesis, blaming instead the "social scientists' poor understanding of the logical relation between theory and fact" in "The Problem Is Epistemology, Not Statistics: Replace Significance Tests by Confidence Intervals and Quantify Accuracy of Risky Numerical Predictions" (Chapter 14 in Harlow (1997)).</ref> However, this absurd assumption that the mean difference between two groups cannot be zero implies that the data cannot be independent and identically distributed (i.i.d.) because the expected difference between any two subgroups of i.i.d. random variates is zero; therefore, the i.i.d. assumption is also absurd.
*Layers of philosophical concerns. The probability of statistical significance is a function of decisions made by experimenters/analysts.<ref name="bakan66">
{{cite journal |last=Bakan |first=David |year=1966 |title=The test of significance in psychological research |journal=Psychological Bulletin |volume=66 |issue=6 |pages=423–437 |doi=10.1037/h0020412 |pmid=5974619}}</ref> If the decisions are based on convention they are termed arbitrary or mindless<ref name="Gigerenzer 587–606">{{cite journal|last=Gigerenzer|first=G|title=Mindless statistics|journal=The Journal of Socio-Economics|date=November 2004|volume=33|issue=5|pages=587–606|doi=10.1016/j.socec.2004.09.033}}</ref> while those not so based may be termed subjective. To minimize type II errors, large samples are recommended. In psychology practically all null hypotheses are claimed to be false for sufficiently large samples so "...it is usually nonsensical to perform an experiment with the ''sole'' aim of rejecting the null hypothesis."<ref>{{cite journal | last = Nunnally | first = Jum | title = The place of statistics in psychology | journal = Educational and Psychological Measurement | volume = 20 | number = 4 | pages = 641–650 | year = 1960 | doi=10.1177/001316446002000401| s2cid = 144813784}}</ref> "Statistically significant findings are often misleading" in psychology.<ref>{{cite journal | last = Lykken | first = David T. | title = What's wrong with psychology, anyway? | journal = Thinking Clearly About Psychology | volume = 1 | pages = 3–39 | year = 1991}}</ref> Statistical significance does not imply practical significance, and [[correlation does not imply causation]]. Casting doubt on the null hypothesis is thus far from directly supporting the research hypothesis.
*"[I]t does not tell us what we want to know".<ref name=cohen94>{{cite journal|author=Jacob Cohen|title=The Earth Is Round (p < .05)|journal=American Psychologist|volume=49|issue=12|pages=997–1003|date=December 1994|doi=10.1037/0003-066X.49.12.997|s2cid=380942}} This paper lead to the review of statistical practices by the APA. Cohen was a member of the Task Force that did the review.</ref> Lists of dozens of complaints are available.<ref name=kline>{{cite book|last=Kline|first=Rex|title=Beyond Significance Testing: Reforming Data Analysis Methods in Behavioral Research|publisher=American Psychological Association|___location=Washington, D.C. |year=2004|isbn=9781591471189 }}</ref><ref name="nickerson">{{cite journal|author=Nickerson, Raymond S.|title=Null Hypothesis Significance Tests: A Review of an Old and Continuing Controversy|journal=Psychological Methods|volume=5|issue=2|pages=241–301|year=2000|url=https://psycnet.apa.org/doiLanding?doi=10.1037%2F1082-989X.5.2.241|doi=10.1037/1082-989X.5.2.241|pmid=10937333|s2cid=28340967|archive-url= https://semanticscholaris.orgmuni.cz/paperel/8c5e0e6f85b9dc15ecf23d43a49404925c4c41bf1423/jaro2010/PSY117/um/_Nickerson_-_NHST_controversy_review.pdf|archive-date=2000-02-23}}</ref><ref name="branch">{{cite journal|author=Branch, Mark|title=Malignant side effects of null hypothesis significance testing|journal=Theory & Psychology|volume=24|issue=2|pages=256–277|year=2014|doi=10.1177/0959354314525282|s2cid=40712136|url=https://semanticscholar.org/paper/48f8711f3ca3535192ce695fa987847725374b0e}}</ref>
 
Critics and supporters are largely in factual agreement regarding the characteristics of null hypothesis significance testing (NHST): While it can provide critical information, it is ''inadequate as the sole tool for statistical analysis''. ''Successfully rejecting the null hypothesis may offer no support for the research hypothesis.'' The continuing controversy concerns the selection of the best statistical practices for the near-term future given the existing practices. However, adequate research design can minimize this issue. Critics would prefer to ban NHST completely, forcing a complete departure from those practices,<ref>{{cite journal |last1=Hunter |first1=John E. |title=Needed: A Ban on the Significance Test |journal=Psychological Science |date=January 1997 |volume=8 |issue=1 |pages=3–7 |doi=10.1111/j.1467-9280.1997.tb00534.x|s2cid=145422959 }}</ref> while supporters suggest a less absolute change.{{citation needed|date=December 2015}}
 
Controversy over significance testing, and its effects on publication bias in particular, has produced several results. The [[American Psychological Association]] has strengthened its statistical reporting requirements after review,<ref name=wilkinson>{{cite journal|author=Wilkinson, Leland|title=Statistical Methods in Psychology Journals; Guidelines and Explanations|journal=American Psychologist|volume=54|issue=8|pages=594–604|year=1999|doi=10.1037/0003-066X.54.8.594|s2cid=428023 }} "Hypothesis tests. It is hard to imagine a situation in which a dichotomous accept-reject decision is better than reporting an actual p value or, better still, a confidence interval." (p 599). The committee used the cautionary term "forbearance" in describing its decision against a ban of hypothesis testing in psychology reporting. (p 603)</ref> [[medical journal]] publishers have recognized the obligation to publish some results that are not statistically significant to combat publication bias,<ref>{{cite web|url=http://www.icmje.org/publishing_1negative.html|title=ICMJE: Obligation to Publish Negative Studies|access-date=September 3, 2012|quote=Editors should seriously consider for publication any carefully done study of an important question, relevant to their readers, whether the results for the primary or any additional outcome are statistically significant. Failure to submit or publish findings because of lack of statistical significance is an important cause of publication bias.|url-status=dead|archive-url=https://web.archive.org/web/20120716211637/http://www.icmje.org/publishing_1negative.html|archive-date=July 16, 2012|df=mdy-all}}</ref> and a journal (''Journal of Articles in Support of the Null Hypothesis'') has been created to publish such results exclusively.<ref name=JASNH>''Journal of Articles in Support of the Null Hypothesis'' website: [http://www.jasnh.com/ JASNH homepage]. Volume 1 number 1 was published in 2002, and all articles are on psychology-related subjects.</ref> Textbooks have added some cautions,<ref>{{cite book|title=Statistical Methods for Psychology|last=Howell|first=David|year=2002|publisher=Duxbury|edition=5|isbn=978-0-534-37770-0|page=[https://archive.org/details/statisticalmetho0000howe/page/94 94]|url= https://archive.org/details/statisticalmetho0000howe/page/94}}</ref> and increased coverage of the tools necessary to estimate the size of the sample required to produce significant results. MajorFew major organizations have not abandoned use of significance tests although some have discussed doing so.<ref name=wilkinson/> For instance, in 2023, the editors of the [[Journal of Physiology]] "strongly recommend the use of estimation methods for those publishing in The Journal" (meaning the magnitude of the [[effect size]] (to allow readers to judge whether a finding has practical, physiological, or clinical relevance) and [[confidence intervals]] to convey the precision of that estimate), saying "Ultimately, it is the physiological importance of the data that those publishing in The Journal of Physiology should be most concerned with, rather than the statistical significance."<ref name="WilliamsToth2023">{{cite journal |last1=Williams |first1=S. |last2=Carson |first2=R. |last3=Tóth |first3=K. |title=Moving beyond P values in The Journal of Physiology: A primer on the value of effect sizes and confidence intervals |journal=J Physiol |date=October 10, 2023 |volume=601 |issue=23 |pages=5131–5133 |doi=10.1113/JP285575 |pmid=37815959 |s2cid=263827430 |doi-access=free }}</ref>
 
==Alternatives==
Line 355 ⟶ 303:
{{See also|Confidence interval#Statistical hypothesis testing}}
 
A unifying position of critics is that statistics should not lead to an accept-reject conclusion or decision, but to an estimated value with an [[interval estimate]]; this data-analysis philosophy is broadly referred to as [[estimation statistics]]. Estimation statistics can be accomplished with either frequentist<ref>{{Cite [journal |last=Ho |first=Joses |last2=Tumkaya |first2=Tayfun |last3=Aryal |first3=Sameer |last4=Choi |first4=Hyungwon |last5=Claridge-Chang |first5=Adam |date=June 19, 2019 |title=Moving beyond P values: data analysis with estimation graphics |url=https://www.ncbinature.nlmcom/articles/s41592-019-0470-3 |journal=Nature Methods |language=en |volume=16 |issue=7 |pages=565–566 |doi=10.nih.gov1038/pubmeds41592-019-0470-3 |issn=1548-7091|url-access=subscription }}</31217592]ref> or Bayesian methods.<ref name="Kruschke 2012">{{cite journal|last=Kruschke|first=J K|author-link=John K. Kruschke|title=Bayesian Estimation Supersedes the T Test|journal=Journal of Experimental Psychology: General|date=July 9, 2012 |volume=142|issue=2|pages=573–603|doi=10.1037/a0029146|pmid=22774788|s2cid=5610231 |url=httphttps://wwwjkkweb.indianasitehost.iu.edu/~kruschke/articles/Kruschke2012JEPG.pdf}}</ref><ref name="Kruschke 2018">{{cite journal|last=Kruschke|first=J K|author-link=John K. Kruschke|title=Rejecting or Accepting Parameter Values in Bayesian Estimation|journal=Advances in Methods and Practices in Psychological Science|date=May 8, 2018|volume=1|issue=2|pages=270–280|doi=10.1177/2515245918771304|s2cid=125788648 |url=https://jkkweb.sitehost.iu.edu/articles/Kruschke2018RejectingOrAcceptingParameterValuesWithSupplement.pdf}}</ref>
 
One strong criticCritics of significance testing suggestedhave aadvocated listbasing ofinference reportingless on p-values and more on confidence intervals for effect sizes for importance, prediction intervals for confidence, replications and extensions for replicability, meta-analyses for generality alternatives:.<ref name=Armstrong1>{{cite journal|author=Armstrong, J. Scott|title=Significance tests harm progress in forecasting|journal=International Journal of Forecasting|volume=23|pages=321–327|year=2007|url=http://repository.upenn.edu/cgi/viewcontent.cgi?article=1104&context=marketing_papers|doi=10.1016/j.ijforecast.2007.03.004|issue=2|citeseerx=10.1.1.343.9516|s2cid=1550979}}</ref> effectBut sizes for importance, prediction intervals for confidence, replications and extensions for replicability, meta-analyses for generality. Nonenone of these suggested alternatives inherently produces a conclusion/decision. Lehmann said that hypothesis testing theory can be presented in terms of conclusions/decisions, probabilities, or confidence intervals.: "The distinction between the ... approaches is largely one of reporting and interpretation."<ref name=Lehmann97>{{cite journal|author=E. L. Lehmann|title=Testing Statistical Hypotheses: The Story of a Book|journal=Statistical Science|volume=12|issue=1|pages=48–52|year=1997|doi=10.1214/ss/1029963261|doi-access=free}}</ref>
 
[[Bayesian inference]] is one proposed alternative to significance testing. (Nickerson cited 10 sources suggesting it, including Rozeboom (1960)).<ref name="nickerson"/> For example, Bayesian [[parameter estimation]] can provide rich information about the data from which researchers can draw inferences, while using uncertain [[Prior probability|priors]] that exert only minimal influence on the results when enough data is available. Psychologist [[John K. Kruschke]] has suggested Bayesian estimation as an alternative for the [[Student's t-test|''t''-test]]<ref name="Kruschke 2012" /> and has also contrasted Bayesian estimation for assessing null values with Bayesian model comparison for hypothesis testing.<ref name="Kruschke 2018" /> Two competing models/hypotheses can be compared using [[Bayes factors]].<ref>{{cite report |last=Kass |first=R. E. |title=Bayes factors and model uncertainty |year=1993|url=http://www.stat.washington.edu/research/reports/1993/tr254.pdf |publisher=Department of Statistics, University of Washington}}</ref> Bayesian methods could be criticized for requiring information that is seldom available in the cases where significance testing is most heavily used. Neither the prior probabilities nor the [[probability distribution]] of the test statistic under the alternative hypothesis are often available in the social sciences.<ref name="nickerson"/>
On one "alternative" there is no disagreement: Fisher himself said,<ref name=fisher /> "In relation to the test of significance, we may say that a phenomenon is experimentally demonstrable when we know how to conduct an experiment which will rarely fail to give us a statistically significant result." Cohen, an influential critic of significance testing, concurred,<ref name=cohen94>{{cite journal|author=Jacob Cohen|title=The Earth Is Round (p < .05)|journal=American Psychologist|volume=49|issue=12|pages=997–1003|date=December 1994|doi=10.1037/0003-066X.49.12.997|s2cid=380942|url=https://semanticscholar.org/paper/2cc7be3d5161e865807e13de7975c9d77fbd2815}} This paper lead to the review of statistical practices by the APA. Cohen was a member of the Task Force that did the review.</ref> "... don't look for a magic alternative to NHST ''[null hypothesis significance testing]'' ... It doesn't exist." "... given the problems of statistical induction, we must finally rely, as have the older sciences, on replication." The "alternative" to significance testing is repeated testing. The easiest way to decrease statistical uncertainty is by obtaining more data, whether by increased sample size or by repeated tests. Nickerson claimed to have never seen the publication of a literally replicated experiment in psychology.<ref name=nickerson /> An indirect approach to replication is [[meta-analysis]].
 
[[Bayesian inference]] is one proposed alternative to significance testing. (Nickerson cited 10 sources suggesting it, including Rozeboom (1960)).<ref name="nickerson"/> For example, Bayesian [[parameter estimation]] can provide rich information about the data from which researchers can draw inferences, while using uncertain [[priors]] that exert only minimal influence on the results when enough data is available. Psychologist John K. Kruschke has suggested Bayesian estimation as an alternative for the [[Student's t-test|''t''-test]].<ref>{{cite journal|last=Kruschke|first=J K|title=Bayesian Estimation Supersedes the T Test|journal=Journal of Experimental Psychology: General|date=July 9, 2012 |volume=142|issue=2|pages=573–603|doi=10.1037/a0029146|pmid=22774788|url=http://www.indiana.edu/~kruschke/articles/Kruschke2012JEPG.pdf}}</ref> Alternatively two competing models/hypothesis can be compared using [[Bayes factors]].<ref>{{cite journal |last=Kass |first=R. E. |title=Bayes factors and model uncertainty |year=1993|url=http://www.stat.washington.edu/research/reports/1993/tr254.pdf |publisher=Department of Statistics, University of Washington}}</ref> Bayesian methods could be criticized for requiring information that is seldom available in the cases where significance testing is most heavily used. Neither the prior probabilities nor the probability distribution of the test statistic under the alternative hypothesis are often available in the social sciences.<ref name="nickerson"/>
 
Advocates of a Bayesian approach sometimes claim that the goal of a researcher is most often to [[objectivity (science)|objectively]] assess the [[probability]] that a [[hypothesis]] is true based on the data they have collected.<ref>{{Cite journal | last = Rozeboom | first = William W
Line 372 ⟶ 318:
| title = The Case for Objective Bayesian Analysis
| journal = Bayesian Analysis | volume = 1 | issue = 3 | pages = 385–402 | year = 2006 | doi=10.1214/06-ba115| doi-access = free }} In listing the competing definitions of "objective" Bayesian analysis, "A major goal of statistics (indeed science) is to find a completely coherent objective Bayesian methodology for learning from data." The author expressed the view that this goal "is not attainable".</ref> Neither [[Ronald Fisher|Fisher]]'s significance testing, nor [[Neyman–Pearson lemma|Neyman–Pearson]] hypothesis testing can provide this information, and do not claim to. The probability a hypothesis is true can only be derived from use of [[Bayes' Theorem]], which was unsatisfactory to both the Fisher and Neyman–Pearson camps due to the explicit use of [[subjectivity]] in the form of the [[prior probability]].<ref name="Neyman 289–337"/><ref>{{cite journal|last=Aldrich|first=J|title=R. A. Fisher on Bayes and Bayes' theorem|journal=Bayesian Analysis|year=2008|volume=3|issue=1|pages=161–170|doi=10.1214/08-BA306|df=mdy-all|doi-access=free}}</ref> Fisher's strategy is to sidestep this with the [[p-value|''p''-value]] (an objective ''index'' based on the data alone) followed by ''inductive inference'', while Neyman–Pearson devised their approach of ''inductive behaviour''.
 
==Philosophy==
Hypothesis testing and philosophy intersect. [[Inferential statistics]], which includes hypothesis testing, is applied probability. Both probability and its application are intertwined with philosophy. Philosopher [[David Hume]] wrote, "All knowledge degenerates into probability." Competing practical definitions of [[Probability#Interpretations|probability]] reflect philosophical differences. The most common application of hypothesis testing is in the scientific interpretation of experimental data, which is naturally studied by the [[philosophy of science]].
 
Fisher and Neyman opposed the subjectivity of probability. Their views contributed to the objective definitions. The core of their historical disagreement was philosophical.
 
Many of the philosophical criticisms of hypothesis testing are discussed by statisticians in other contexts, particularly [[correlation does not imply causation]] and the [[design of experiments]].
Hypothesis testing is of continuing interest to philosophers.<ref name=Lenhard/><ref name="doi10.1093/bjps/axl003">
{{Cite journal | last1 = Mayo | first1 = D. G. | last2 = Spanos | first2 = A. | title = Severe Testing as a Basic Concept in a Neyman–Pearson Philosophy of Induction | doi = 10.1093/bjps/axl003 | journal = The British Journal for the Philosophy of Science | volume = 57 | issue = 2 | pages = 323–357 | year = 2006 | citeseerx = 10.1.1.130.8131 | s2cid = 7176653 }}</ref>
 
==Education==
{{main|Statistics education}}
Statistics is increasingly being taught in schools with hypothesis testing being one of the elements taught.<ref>[http://www.corestandards.org/the-standards/mathematics/hs-statistics-and-probability/introduction/ Mathematics > High School: Statistics & Probability > Introduction] {{webarchive|url=https://archive.is/20120728122912/http://www.corestandards.org/the-standards/mathematics/hs-statistics-and-probability/introduction/ |date=July 28, 2012 }} Common Core State Standards Initiative (relates to USA students)</ref><ref>[http://www.collegeboard.com/student/testing/ap/sub_stats.html College Board Tests > AP: Subjects > Statistics] The College Board (relates to USA students)</ref> Many conclusions reported in the popular press (political opinion polls to medical studies) are based on statistics. Some writers have stated that statistical analysis of this kind allows for thinking clearly about problems involving mass data, as well as the effective reporting of trends and inferences from said data, but caution that writers for a broad public should have a solid understanding of the field in order to use the terms and concepts correctly.<ref name=Huff8>{{cite book|last=Huff|first=Darrell|title=How to lie with statistics|publisher=Norton|___location=New York|year=1993|isbn=978-0-393-31072-6|page=[https://archive.org/details/howtoliewithstat00huff/page/8 8]|url=https://archive.org/details/howtoliewithstat00huff/page/8}}'Statistical methods and statistical terms are necessary in reporting the mass data of social and economic trends, business conditions, "opinion" polls, the census. But without writers who use the words with honesty and readers who know what they mean, the result can only be semantic nonsense.'</ref><ref name=S&C>{{cite book|last1=Snedecor|first1=George W.|last2=Cochran|first2=William G.|title=Statistical Methods|publisher=Iowa State University Press|___location=Ames, Iowa|year=1967|edition=6|page=3}} "...the basic ideas in statistics assist us in thinking clearly about the problem, provide some guidance about the conditions that must be satisfied if sound inferences are to be made, and enable us to detect many inferences that have no good logical foundation."</ref>{{citation needed|date=April 2012}}<ref name=Huff8/><ref name=S&C/>{{citation needed|date=April 2012}} An introductory college statistics class places much emphasis on hypothesis testing – perhaps half of the course. Such fields as literature and divinity now include findings based on statistical analysis (see the [[Bible Analyzer]]). An introductory statistics class teaches hypothesis testing as a cookbook process. Hypothesis testing is also taught at the postgraduate level. Statisticians learn how to create good statistical test procedures (like ''z'', Student's ''t'', ''F'' and chi-squared). Statistical hypothesis testing is considered a mature area within statistics,<ref name=Lehmann97/> but a limited amount of development continues.
 
An academic study states that the cookbook method of teaching introductory statistics leaves no time for history, philosophy or controversy. Hypothesis testing has been taught as received unified method. Surveys showed that graduates of the class were filled with philosophical misconceptions (on all aspects of statistical inference) that persisted among instructors.<ref>{{cite journal|last1=Sotos|first1=Ana Elisa Castro|last2=Vanhoof|first2=Stijn|last3=Noortgate|first3=Wim Van den|last4=Onghena|first4=Patrick|title=Students' Misconceptions of Statistical Inference: A Review of the Empirical Evidence from Research on Statistics Education|journal=Educational Research Review|volume=2|issue=2|pages=98–113|year=2007|doi=10.1016/j.edurev.2007.04.001|url=https://lirias.kuleuven.be/bitstream/123456789/136347/1/CastroSotos.pdf}}</ref> While the problem was addressed more than a decade ago,<ref>{{cite journal|last=Moore|first=David S.|title=New Pedagogy and New Content: The Case of Statistics|journal=International Statistical Review|volume=65|issue=2|pages=123–165|year=1997|doi=10.2307/1403333|url=http://www.stat.auckland.ac.nz/~iase/publications/isr/97.Moore.pdf|jstor=1403333}}</ref> and calls for educational reform continue,<ref>{{Cite journal|last1=Hubbard |first1=Raymond |last2=Armstrong |first2=J. Scott |author-link2=J. Scott Armstrong |title=Why We Don't Really Know What Statistical Significance Means: Implications for Educators |doi=10.1177/0273475306288399 |url=http://hops.wharton.upenn.edu/ideas/pdf/Armstrong/StatisticalSignificance.pdf |journal=Journal of Marketing Education |volume=28 |issue=2 |pages=114–120 |year=2006 |url-status=unfit |archive-url=https://web.archive.org/web/20060518054857/http://hops.wharton.upenn.edu/ideas/pdf/Armstrong/StatisticalSignificance.pdf |archive-date=May 18, 2006 |hdl=2092/413 |s2cid=34729227 |hdl-access=free }} [http://escholarshare.drake.edu/bitstream/handle/2092/413/WhyWeDon't.pdf Preprint]</ref> students still graduate from statistics classes holding fundamental misconceptions about hypothesis testing.<ref>{{cite journal|last1=Sotos|first1=Ana Elisa Castro|last2=Vanhoof|first2=Stijn|last3=Noortgate|first3=Wim Van den|last4=Onghena|first4=Patrick|title=How Confident Are Students in Their Misconceptions about Hypothesis Tests?|journal=Journal of Statistics Education|volume=17|number=2|year=2009|doi=10.1080/10691898.2009.11889514|doi-access=free}}</ref> Ideas for improving the teaching of hypothesis testing include encouraging students to search for statistical errors in published papers, teaching the history of statistics and emphasizing the controversy in a generally dry subject.<ref name="Gigerenzer 2004 391–408">{{cite book |last=Gigerenzer |first=G. |chapter=The Null Ritual What You Always Wanted to Know About Significant Testing but Were Afraid to Ask |title=The SAGE Handbook of Quantitative Methodology for the Social Sciences |year=2004 |pages=391–408 |chapter-url=http://library.mpib-berlin.mpg.de/ft/gg/GG_Null_2004.pdf |doi=10.4135/9781412986311|isbn=9780761923596 }}</ref>
 
==See also==
Line 403 ⟶ 334:
* [[Look-elsewhere effect]]
* [[Modifiable areal unit problem]]
* [[Modifiable temporal unit problem]]
* [[Multivariate hypothesis testing]]
* [[Omnibus test]]
* [[Dichotomous thinking]]
*[[Almost sure hypothesis testing]]
*[[Akaike information criterion]]
{{div col end}}
*[[Bayesian information criterion]]
*[[E-values]]{{div col end}}
 
==References==
Line 418 ⟶ 352:
==External links==
{{Commons category|Hypothesis testing}}
{{Wikiversity|at=Introduction to Statistical Analysis/Unit 5 Content}}
* {{springer|title=Statistical hypotheses, verification of|id=p/s087400}}
* {{Cite web|title=Hypothesis Testing |last=Wilson González |first=Georgina |author2=Kay Sankaran |work=Environmental Sampling & Monitoring Primer |url=http://www.webapps.cee.vt.edu/ewr/environmental/teach/smprimer/hypotest/ht.html |publisher=Virginia Tech |date=September 10, 1997 }}
* [http://www.cs.ucsd.edu/users/goguen/courses/275f00/stat.html Bayesian critique of classical hypothesis testing]
* [https://web.archive.org/web/20051124221846/http://www.npwrc.usgs.gov/resource/methods/statsig/stathyp.htm Critique of classical hypothesis testing highlighting long-standing qualms of statisticians]
* Dallal GE (2007) [http://www.tufts.edu/~gdallal/LHSP.HTM The Little Handbook of Statistical Practice] (A good tutorial)
* [http://core.ecu.edu/psyc/wuenschk/StatHelp/NHST-SHIT.htm References for arguments for and against hypothesis testing]
* [https://web.archive.org/web/20091029162244/http://www.wiwi.uni-muenster.de/ioeb/en/organisation/pfaff/stat_overview_table.html Statistical Tests Overview:] How to choose the correct statistical test
* [https://arxiv.org/abs/1401.2851] Statistical Analysis based Hypothesis Testing Method in Biological Knowledge Discovery; Md. Naseef-Ur-Rahman Chowdhury, Suvankar Paul, Kazi Zakia Sultana
 
===Online calculators===
* [http://www.mbastats.net MBAStats confidence interval and hypothesis test calculators]
* Some [http://www.schramm.cc/link/Statistics-calculator.php p-value and hypothesis test calculators].
 
{{Statistics|inference||state=collapsed}}
{{Public health}}
{{Authority control}}
 
{{DEFAULTSORT:Statistical Hypothesis Testing}}
[[Category:Statistical hypothesis testing| ]]
[[Category:Design of experiments]]
[[Category:Psychometrics]]
[[Category:Logic and statistics]]
[[Category:Mathematical and quantitative methods (economics)]]