Bad Stats: Not What It Seems
Towards a Statistical Reform in HCI and Visualization
|This page provides arguments and reading material to explain why it would be beneficial for human-computer interaction and information visualization (HCI and infovis) to stop doing mindless null hypothesis significance testing (NHST) and start reporting informative charts with effect sizes and interval estimates, as well as offering more nuanced interpretations of our results. Our scientific standards can also be greatly improved by planning analyses and sharing experimental material online.|
To Start with
Listen to this 15-minute speech from Geoff Cumming, a prominent researcher in statistical cognition. This will give you a good overview of what follows.
CHI '16 – Special Interest Group on Transparent Statistics in HCI
With Matthew Kay, Steve Haroz and Shion Guha, I co-organized a SIG meeting at CHI '16 on transparent statistical reporting. The meeting generated lots of interest and we are now looking at forming a community and developing some concrete recommendations to move the field forward. See here for more info.
Book Chapter – Fair Statistical Communication in HCI
I wrote a book chapter titled Fair Statistical Communication in HCI that attempts to explain why and how we can do stats without p-values in HCI and Vis. The book is titled Modern Statistical Methods for HCI and is edited by Judy Robertson and Maurits Kaptein. I previously released an early draft as a research report titled HCI Statistics without p-values. However, I recommend that you use the Springer chapter instead, as it's more up-to-date and much improved.
BELIV 2014 Keynote – Bad Stats are Miscommunicated Stats
I gave a keynote talk at the BELIV 2014 bi-annual workshop entitled "Bad Stats are Miscommunicated Stats". More information on the workshop page.
Summary of the Talk
When reporting on user studies, we often need to do stats. But many of us have little training in statistics, and we are just as anxious about doing it right as we are eager to incriminate others for any flaw we might spot. Violations of statistical assumptions, too small samples, uncorrected multiple comparisons—deadly sins abound. But our obsession with flaw-spotting in statistical procedures makes us miss far more serious issues and the real purpose of statistics. Stats are here to help us communicate about our experimental results for the purpose of advancing scientific knowledge. Science is a cumulative and collective enterprise, so miscommunication, confusion and obfuscation are much more damaging than moderately inflated Type I error rates.
In my talk, I argue that the most common form of bad stats are miscommunicated stats. I also explain why we all have been faring terribly according to this criteria—mostly due to our blind endorsement of the concept of statistical significance. This idea promotes a form of dichotomous thinking that not only gives a highly misleading view of the uncertainty in our data, but also encourages questionable practices such as selective data analysis and various other forms of convolutions to reach the sacred .05 level. While researchers’ reliance on mechanical statistical testing rituals is both deeply entrenched and severely criticized in a range of disciplines—and has been so for more than 50 years—it is particularly striking that it has been so easily endorsed by our community. We repeatedly stress the crucial role of human judgment when analyzing data, but do the opposite when we conduct or review statistical analyses from user experiments. I believe that we can cure our schizophrenia and substantially improve our scientific production by banning p-values, by reporting empirical data using clear figures with effect sizes and confidence intervals, and by learning to provide nuanced interpretations of our results. We can also dramatically raise our scientific standards by pre-specifying our analyses, fully disclosing our results, and sharing extensive replication material online. These are small but important reforms that are much more likely to improve science than methodological nitpicking on statistical testing procedures.
Download the slides (PDF 3MB)
Alt.CHI 2014 Paper – Running an HCI Experiment in Multiple Parallel Universes
Pierre Dragicevic, Fanny Chevalier and Stéphane Huot (2014) Running an HCI Experiment in Multiple Parallel Universes. In ACM CHI Conference on Human Factors in Computing Systems (alt.chi). Toronto, Canada, Apr, Apr 2014.
Why this Strange Article?
The purpose of this alt.chi paper was to raise the community's awareness on the need to question the statistical procedures we currently use to interpret and communicate our experiment results, i.e., the ritual of null hypothesis significance testing (NHST). The paper does not elaborate on the numerous problems of NHST, nor does it discuss how statistics should be ideally done, as these points have been already covered in hundreds of articles published across decades in many disciplines. The interested reader can find a non-exhaustive list of references below. For more references, simply google NHST criticism.
A comment often elicited by this paper is that HCI researchers should learn to use NHST properly, or that they should use larger sample sizes. This is not the point we are trying to make. Also, we are not implying that we should reject inferential statistics altogether, or that we should stop caring about statistical error. See the discussion thread from the alt.chi open review process. Our personal position is that HCI research can and should get rid of NHST procedures and p-values, and instead switch to reporting (preferably unstandardized) effect sizes with interval estimates — e.g., 95% confidence intervals — as recommended by many methodologists.
Download the slides (PDF 5MB)
In our presentation we take an HCI perspective to rebut common arguments against the discontinuation of NHST, namely: i) the problem is NHST misuse, ii) the problem is low statistical power, iii) NHST is needed to test hypotheses or make decisions, iv) p-values and confidence intervals are equivalent anyways, v) we need both.
Quotes About Null Hypothesis Significance Testing (NHST)
The number of papers papers stressing the deep flaws of NHST is simply bewildering. Sadly, the awareness of this literature seems very low in HCI and Infovis. Yet most criticisms are not about the theoretical underpinnings of NHST, but about its usability. Will HCI and Infovis give the good example to other disciplines?
|“...no scientific worker has a fixed level of significance at which from year to year, and in all circumstances, he rejects hypotheses; he rather gives his mind to each particular case in the light of his evidence and his ideas.”|
Sir Ronald A. Fisher (1956), quoted by Gigerenzer (2004).
|“[NHST] is based upon a fundamental misunderstanding of the nature of rational inference, and is seldom if ever appropriate to the aims of scientific research.”|
Rozeboom (1960), quoted by Levine et al. (2008)
|“Statistical significance is perhaps the least important attribute of a good experiment; it is never a sufficient condition for claiming that a theory has been usefully corroborated, that a meaningful empirical fact has been established, or that an experimental report ought to be published.”|
Likken (1968), quoted by Levine et al. (2008)
|“Small wonder that students have trouble [learning significance testing]. They may be trying to think.”|
Deming (1975), quoted by Ziliak and McCloskey (2009)
|“I believe that the almost exclusive reliance on merely refuting the null hypothesis as the standard method for corroborating substantive theories is a terrible mistake, is basically unsound, poor scientific strategy, and one of the worst things that ever happened in the history of psychology. I am not making some nit-picking statistician’s correction. I am saying that the whole business is so radically defective as to be scientifically almost pointless.”|
Meehl (1978), quoted by Levine et al. (2008)
|“After 4 decades of severe criticism, the ritual of null hypothesis significance testing -- mechanical dichotomous decisions around a sacred .05 criterion -- still persists.”|
Jacob Cohen (1994)
|“Statistical significance testing retards the growth of scientific knowledge; it never makes a positive contribution.”|
Schmidt and Hunter (1997)
|“Logically and conceptually, the use of statistical significance testing in the analysis of research data has been thoroughly discredited.”|
Schmidt and Hunter (1997), quoted by Levine et al. (2008)
|“D. Anderson, Burnham and W.Thompson (2000) recently found more than 300 articles in different disciplines about the indiscriminate use of NHST, and W. Thompson (2001) lists more than 400 references on this topic. [...] After review of the debate about NHST, I argue that the criticisms have sufficient merit to support the minimization or elimination of NHST.”|
Rex B Kline (2004)
|“Our unfortunate historical commitment to significance tests forces us to rephrase good questions in the negative, attempt to reject those nullities, and be left with nothing we can logically say about the questions.”|
Killen (2005), quoted by Levine et al. (2008)
|“If these arguments are sound, then the continuing popularity of significance tests in our peer-reviewed journals is at best embarrassing and at worst intellectually dishonest.”|
|“Many scientific disciplines either do not understand how seriously weird, deficient, and damaging is their reliance on null hypothesis significance testing (NHST), or they are in a state of denial.”|
Geoff Cumming, in his open review of our alt.chi paper
|“A ritual is a collective or solemn ceremony consisting of actions performed in a prescribed order. It typically involves sacred numbers or colors, delusions to avoid thinking about why one is performing the actions, and fear of being punished if one stops doing so. The null ritual contains all these features.”|
To learn more about issues with NHST and how they can be easily addressed with interval estimates, here are two good starting points that will only take 25 minutes of your time in total.
- (Cumming, 2011) Significant Does not Equal Important: Why we Need the New Statistics (radio interview).
A short podcast accessible to a wide audience. Focuses on the understandability of p values vs. confidence intervals. In an ideal world, this argument alone should suffice to convince people to whom usability matters.
- (Cumming, 2013) The Dance of p values (video).
A nice video demonstrating how unreliable p values are across replications. Methodologist and statistical cognition researcher Geoff Cumming has communicated a lot on this particular problem, although many other problems have been covered in previous literature (references below). An older version of this video was the main source of inspiration for our alt.chi paper..
Those are suggested initial readings. They are often quite accessible and provide a good coverage of the problems with NHST, as well as their solutions.
- (Cohen, 1994) The Earth is Round (p < 0.05).
Perhaps the most famous paper criticizing NHST.
- (Loftus, 1993) A picture is worth a thousand p values: On the irrelevance of hypothesis testing in the microcomputer age.
Argues for using figures with means and standard errors (or CIs) instead of NHST. Illustrates with two nice examples.
- (Schmidt and Hunter, 1997) Eight common but false objections to the discontinuation of significance testing in the analysis of research data.
Rebuts common arguments for keeping NHST. Among other things, explains why increasing sample size and power will not work as a solution, why educating researchers to minimize NHST misuse won't work either, and why dichotomous accept/reject hypothesis testing is futile. Recommends to use confidence intervals instead and to emphasize meta-analysis. This is a very convincing paper but be warned that it is sometimes quoted as making excessive claims.
- (Lambdin, 2012) Significance tests as sorcery: Science is empirical—significance tests are not.
Another strong criticism of NHST. Summarizes arguments against NHST very nicely, and provides many references.
- (Giner-Sorolla, 2012). Science or art? How aesthetic standards grease the way through the publication bottleneck but undermine science.
An excellent paper that explains how the publication system forces us to trade off good science for nice stories.
- (Cumming and Finch, 2005) Inference by Eye: Confidence Intervals and How to Read Pictures of Data.
Clarifies how to read confidence intervals and their equivalence with NHST.
- (Cumming, 2014) The New Statistics: Why and How.
Covers many topics on how to do good empirical research. Highly recommended as a general reference. The proposed guidelines have been recently adopted by the editorial committee of the Psychological Science Journal.
Other references, some of which are rather accessible while others may be a bit technical for non-statisticians like most of us. Best is to skip the difficult parts and return to the paper later on. I have plenty of other references that I won't have time to add here, for more see my book chapter draft.
- (Kline, 2004) Beyond significance testing: Reforming data analysis methods in behavioral research (Chapter 3).
A broad overview of the problems of NHST and publications questioning significance testing (~300 since 1950s). Recommends to downplay or eliminate NHST, report effect sizes with confidence intervals instead, and focus on replication.
- (Gigerenzer, 2004) Mindless statistics.
Another (better) version of Gigerenzer's paper The Null Ritual - What You Always Wanted to Know About Significance Testing but Were Afraid to Ask. Explains why the NHST ritual is a confusing combination of Fisher and Neyman/Pearson methods. Explains why few people understand p, problems with meaningless nil hypotheses, problems with the use of "post-hoc alpha values", pressures on students by researchers, researchers by editors, etc. Important paper although not a very easy read.
- (Ziliak and McCloskey, 2009) The Cult of Statistical Significance.
A harsh criticism of NHST, from the authors of the book "The Cult of Statistical Significance: How the Standard Error Costs Us Jobs, Justice, and Lives ". Nicely explains why effect size is more important than precision (which refers to both p values and widths of confidence intervals).
- (Gelman, 2013) Interrogating p-values.
Explains towards the end why p-values are inappropriate even for making decisions, and that effect sizes should be used instead.
- (Cumming, 2008) p Values Predict the Future Only Vaguely, but Confidence Intervals Do Much Better.
Examines the unreliability of p values and introduces p intervals. For example, when one obtains p = .05, there is a 80% chance that the next replication yields p inside [.00008, .44], and a 20% chance it's outside.
- (Cumming, 2012) Understanding The New Statistics: Effect Sizes, Confidence Intervals, and Meta-Analysis.
Geoff Cumming's book: the first chapter contains a pedagogical explanation of why estimation is superior to NHST. Further chapters explain in detail how to build, plot and read confidence intervals. Recommended as a textbook, although it mostly covers simple experiment designs and assumes a basic knowledge of traditional stats. Geoff Cumming is currently writing a general stats intro textbook.
As Cumming's popularity is growing his detractors are proliferating, but they sadly miss the point of Cumming's philosophy.
Effect size does not mean complex statistics. If you report mean task completion times, or differences in mean task completion times, you're reporting effect sizes.
- (Baguley, 2009). Standardized or simple effect size: What should be reported?
Explains why it is often preferable to report effect sizes in the original units rather than computing standardized effect sizes like Cohen's d.
- (Dragicevic, 2012). My Technique is 20% Faster: Problems with Reports of Speed Improvements in HCI.
Explains why terms such as "20% faster" are confusing.
Bootstrapping is an easy method for computing confidence intervals for all kinds of distributions. It's too computationally intensive to have been even conceivable in the past, but now we have this thing called a computer. Oh, and it's non-deterministic — if this disturbs you, then maybe take another look at our alt.chi paper!
- Confidence Intervals Using Bootstrapping: Statistics Help (video).
A video explaining bootstrapping.
- Confidence intervals for the mean using bootstrapping.
Short explanation on how to compute bootstrap CIs in R (provides several methods).
- BootES: An R Package for Bootstrap Confidence Intervals on Effect Sizes
A scientific article introducing a powerful R package for computing all sorts of CIs. Discusses the theoretical support behind bootstrapping while remaining very accessible (for a psychology audience). Advocates the use of contrasts instead of ANOVAs.
In addition to having been greatly over-emphasized, the notion of type I error is misleading. One issue is that the nil hypothesis of no effect is almost always false, so it's practically impossible to commit a type I error. On the other hand, many NHST critics stress the absurdity of testing nil hypotheses, but this is rarely what people (especially in HCI) are really doing. Some articles on this page clarify this, here are more articles specifically on the topic of statistical errors. More to come.
- (Gelman, 2004) Type 1, type 2, type S, and type M errors (Blog bost).
Coins type S errors (errors of sign) and type M errors (errors of magnitude). Points to an article that is more technical.
- (Pollard, 1987) On the Probability of Making Type I Errors.
Explains why the alpha level does not give you the probability of making a type I error.
How to deal with multiple comparisons? Use a simple experiment design, pick meaningful comparisons in advance, and use interval estimates.
- (Perneger, 1998) What’s wrong with Bonferroni adjustments.
Adjustment procedures for multiple comparisons is not uncontroversial. One issue is that they dramatically inflate type II errors. This author recommends to simply report what we analyzed and why, and rely on our (and our readers') judgment. This judgment is naturally greatly facilitated by the use of an estimation approach.
- (Smith et al., 2002) The High Cost of Complexity in Experimental Design and Data Analysis -- Type I and Type II Error Rates in Multiway ANOVA.
This paper complements the one above: to avoid the problem of multiple comparisons, best is to design simpler experiments and decide on what is worth comparing. Related to pre-specification of analyses (Cumming, 2014) and to contrast analysis.
Still looking for an accessible introduction. Suggestions welcome.
HCI is a young field and rarely has deep hypotheses to test. If you only have superficial research questions, frame them in a quantitative manner (e.g., is my technique better and if so, to what extent?) and answer them using estimation. If you have something deeper to test, be careful how you do it. It's hard.
- (Meehl, 1967) Theory-testing in Psychology and Physics: A Methodological Paradox.
Contrasts substantive theories with statistical hypotheses, and explains why testing the latter provides only a very weak confirmation of the former. Also clarifies that point nil hypotheses in NHST are really used for testing directional hypotheses (see also section 'Errors' above). Paul Meehl is an influential methodologist and has written other important papers, but his style is a bit dry. I will try to add more references from him on this page later on.
- (Kruschke, 2010) Doing Bayesian Data Analysis: A Tutorial with R and BUGS.
This book has been recommended to me several times as being an accessible introduction to Bayesian methods, although I did not read it yet. If you are willing to invest time, Bayesian (credible) intervals are worth studying as they have several advantages over confidence intervals.
- (Kruschke and Liddell, 2015) The Bayesian New Statistics: Two historical trends converge.
Explains why estimation (rather than null hypothesis testing) and Bayesian (rather than frequentist) methods are complementary, and why we should do both.
Other papers cited in the BELIV talk
- (Dawkins, 2011) The Tyranny of the Discontinuous Mind.
A short but insightful article by Richard Dawkins that explains why dichotomous thinking, or more generally categorical thinking, can be problematic.
- (Rosenthal and Gaito, 1964) Further evidence for the cliff effect in interpretation of levels of significance.
A short summary of early research in statistical cognition showing that people trust p-values that are sightly below 0.05 much more than p-values that are slightly above. This seems to imply that Fisher's suggestion to minimize the use of decision cut-offs and simply report p-values as a measure of strength of evidence may not work.
Papers (somehow) in favor of NHST
To fight confirmation bias here is an attempt to provide credible references that go against the discontinuation of NHST. Many are valuable in that they clarify what NHST is really about.
- (Abelson, 1995) Statistics as Principled Argument (Chapter 1).
A book lauded by (Cairns, 2007) below. The link points to a free version of the first chapter. Focuses on arguments and story that stats serve to build. Acknowledges that NHST is only a starting point (book was reviewed by Jacob Cohen). Quite interesting and insightful.
- (Abelson, 1997) A Retrospective on the Significance Test Ban of 1999 (If There Were no Significance Tests, they Would be Invented).
This paper wants to be in defense of NHST, and uses a dystopic scenario to achieve this (1999 is after the article publication). Both the presentation of NHST (which clarifies that NHST typically serves to test directional, not nil hypotheses) and the discussion of its misuses are excellent. However, the defense of NHST is rather unconvincing: if confidence intervals must be provided, then it is not clear why p-values should be kept.
- (Levine et al, 2008) A Critical Assessment of Null Hypothesis Significance Testing in Quantitative Communication Research.
Although this paper's focus is on summarizing known issues with NHST, it is not against its discontinuation. It contains a very good paragraph summarizing NHST.
- (Levine et al, 2008) A Communication Researchers' Guide to Null Hypothesis Significance Testing and Alternatives.
The companion article of the reference above. Again, says NHST should be complemented with effect size estimates and confidence intervals, but does not explain why NHST should be kept. Discusses "effect tests" and equivalence testing.
Papers against confidence intervals
For several decades Bayesians have (more or less strongly) objected to the use of confidence intervals, but their claim that "CIs don't solve the problems of NHST" is misleading. Confidence intervals and credible (Bayesian) intervals are both interval estimates, thus a great step forward compared to dichotomous testing. Credible intervals arguably rest on more solid theoretical foundations and have a number of advantages, but they also have a higher switching cost. By claiming that confidence intervals are invalid and should never be used, Bayesian idealists may just give NHST users an excuse for sticking to their ritual.
- (Morey et al, in press) The Fallacy of Placing Confidence in Confidence Intervals.
This is the most recent paper on the topic, with many authors and with the boldest claims. It is very informative and shows that CIs can behave in pathological ways in specific situations. Unfortunately, it is hard to assess the real consequences, i.e., how much we are off when we interpret CIs a plausible values in realistic user studies. Also see this long discussion thread on an older reversion, especially toward the end..
Papers from the HCI Community
There has been several papers in HCI questioning current practices, although none of them calls for a discontinuation or even minimization of NHST, none of them mentions the poor replicability of p-values, and no satisfactory treatment of interval estimation is provided.
- (Cairns, 2007) HCI... not as it should be: inferential statistics in HCI research
A good summary of sloppy use of NHST methods in HCI, with a nice explanation of NHST in the introduction and an interesting discussion at the end. Overall in favor of NHST but duly acknowledges that it is not the only method.
- (Wilson et al, 2011) RepliCHI -- CHI should be replicating and validating results more.
The CHI workshop on replication that started the repliCHI initiative. Replication is important, and as explained in (Cumming, 2012), estimation promotes replication and makes meta-analyses possible. NHST does neither. I have been told that the RepliCHI initiative has been discontinued.
- (Robertson, 2011) Stats: We're Doing It Wrong (blog post).
Covers problems related to statistical assumptions and the need to report effect size, although it does not discuss deeper problems with NHST.
- (Kaptein and Robertson, 2012) Rethinking statistical analysis methods for CHI.
Reviews a few common problems with NHST, but does not cover how some of these issues can be addressed by taking an estimation approach.
- (Hornbaek et al, 2014) Is Once Enough? On the Extent and Content of Replications in Human-Computer Interaction.
Helpfully clarifies what a replication is in HCI, what it is good for, and to what extent it has been used in the past.
Examples of HCI and VIS papers without p-values
- The CHI 2013 paper by Jansen, Dragicevic and Fekete Evaluating the Efficiency of Physical Visualizations makes only limited use of NHST and comes with a page with replication material.
- Yvonne Jansen's PhD thesis Physical and Tangible Information Visualization analyzes all experimental data using estimation (Chapters 5 and 6). The PhD thesis has no single p-value.
- The VIS 2014 paper by Chevalier, Dragicevic and Franconeri The Not So Staggering Effect of Staggered Animations makes no use of p-values, separates pre-planned from exploratory analyses, and has a page with extensive replication material.
- The VIS 2014 paper by Talbot, Setlur, and Anand Four Experiments on the Perception of Bar Charts makes no use of p-values and has replication material online. Their paper replicates a well-known 1984 study by Cleveland and McGill that also used estimation. Cleveland and McGill used bootstrap confidence intervals only a few years after Efron introduced them.
- The CHI 2015 study by Willett, Jenny, Isenberg and Dragicevic Lightweight Relief Shearing for Enhanced Terrain Perception on Interactive Maps has received a best paper award despite having not a single p-value. It has replication material online.
- The CHI 2015 study by Wacharamanotham, Subramanian, Volkel and Borchers Statsplorer: Guiding Novices in Statistical Analysis also bases all of its analyses on confidence intervals and has no p-value.
- The CHI 2015 study by Goffin, Bezerianos, Willett and Isenberg (published as an extended abstract) Exploring the Effect of Word-Scale Visualizations on Reading Behavior uses estimation and plots instead of p-values.
- The VIS 2015 paper by Yu, Efstathiou, Isenberg and Isenberg CAST: Effective and Efficient User Interaction for Context-Aware Selection in 3D Particle Clouds also has no p-value.
- The VIS 2015 paper by Jansen and Hornbæk A Psychophysical Investigation of Size as a Physical Variable has no p-value and shares replication/reanalysis material online.
- The VIS 2015 paper by Kay and Heer Beyond Weber’s Law: A Second Look at Ranking Visualizations of Correlation uses Bayesian estimation and only has a single p-value. All analyses are available online. The paper has received a Best Paper Honorable Mention.
- The VIS 2015 paper by Boy, Eveillard, Detienne and Fekete Suggested Interactivity: Seeking Perceived Affordances for Information Visualization has no p-value either.
- The CHI 2016 paper by Zhao, Glueck, Chevalier, Wu and Khan Egocentric Analysis of Dynamic Networks with EgoLines uses estimation and descriptive statistics (no p-value), and has received a honorable mention award.
- Contact us if you know of more papers. We'll be happy to add yours! Make sure the PDF is available online and not behind a paywall.
The papers we co-author are by no means prescriptive. We are still polishing our methods and learning. We are grateful to Geoff Cumming for his help and support.
List of people currently involved in this initiative:
- Pierre Dragicevic, Inria
- Fanny Chevalier, Inria
- Stéphane Huot, Inria
- Yvonne Jansen, University of Copenhagen
- Wesley Willett, University of Calgary
- Charles Perin, University of Calgary
- Petra Isenberg, Inria
- Jeremy Boy, New-York University
- Email us if you wish to contribute!
|All material on this page is CC-BY-SA. You can reuse or adapt it provided you link to www.aviz.fr/badstats.|