Bad Stats: Not What It Seems

Towards a Statistical Reform in HCI and Visualization

Pierre Dragicevic and colleagues

This page provides arguments and reading material to explain why it would be beneficial for human-computer interaction and information visualization (HCI and infovis) to stop doing mindless null hypothesis significance testing (NHST) and start reporting informative charts with effect sizes and interval estimates, as well as offering more nuanced interpretations of our results. Our scientific standards can also be greatly improved by planning analyses and sharing experimental material online. The bottom of this page lists papers published at CHI and VIS which contain no single p-value, some of which received best paper or honorable mention awards.
Content:
Special Interest Group on Transparent Statistics in HCI (CHI SIG Meeting)
Fair Statistical Communication in HCI (book chapter)
Bad Stats are Miscommunicated Stats (BELIV 2014 Keynote)
Running an HCI Experiment in Multiple Parallel Universes (Alt.CHI 2014 Paper)
Quotes about null hypothesis significance testing (NHST)
Links
Reading List
More Readings
Papers (somehow) in favor of NHST
Papers against confidence intervals
Papers from the HCI Community
Examples of HCI and VIS papers without p-values

To Start with

Listen to this 15-minute speech from Geoff Cumming, a prominent researcher in statistical cognition. This will give you a good overview of what follows.

CHI '16 – Special Interest Group on Transparent Statistics in HCI

With Matthew Kay, Steve Haroz and Shion Guha, I co-organized a SIG meeting at CHI '16 on transparent statistical reporting. The meeting generated lots of interest and we are now looking at forming a community and developing some concrete recommendations to move the field forward. See here for more info.

Book Chapter – Fair Statistical Communication in HCI

I wrote a book chapter titled Fair Statistical Communication in HCI that attempts to explain why and how we can do stats without p-values in HCI and Vis. The book is titled Modern Statistical Methods for HCI and is edited by Judy Robertson and Maurits Kaptein. I previously released an early draft as a research report titled HCI Statistics without p-values. However, I recommend that you use the Springer chapter instead (pre-print link below), as it's more up-to-date and much improved.

Get the final book chapter from Springer
Download author version (PDF 1.8 MB)
See errata, new / updated tips, and comments.

BELIV 2014 Keynote – Bad Stats are Miscommunicated Stats

I gave a keynote talk at the BELIV 2014 bi-annual workshop entitled "Bad Stats are Miscommunicated Stats". More information on the workshop page.

Summary of the Talk

When reporting on user studies, we often need to do stats. But many of us have little training in statistics, and we are just as anxious about doing it right as we are eager to incriminate others for any flaw we might spot. Violations of statistical assumptions, too small samples, uncorrected multiple comparisons—deadly sins abound. But our obsession with flaw-spotting in statistical procedures makes us miss far more serious issues and the real purpose of statistics. Stats are here to help us communicate about our experimental results for the purpose of advancing scientific knowledge. Science is a cumulative and collective enterprise, so miscommunication, confusion and obfuscation are much more damaging than moderately inflated Type I error rates.

In my talk, I argue that the most common form of bad stats are miscommunicated stats. I also explain why we all have been faring terribly according to this criteria—mostly due to our blind endorsement of the concept of statistical significance. This idea promotes a form of dichotomous thinking that not only gives a highly misleading view of the uncertainty in our data, but also encourages questionable practices such as selective data analysis and various other forms of convolutions to reach the sacred .05 level. While researchers’ reliance on mechanical statistical testing rituals is both deeply entrenched and severely criticized in a range of disciplines—and has been so for more than 50 years—it is particularly striking that it has been so easily endorsed by our community. We repeatedly stress the crucial role of human judgment when analyzing data, but do the opposite when we conduct or review statistical analyses from user experiments. I believe that we can cure our schizophrenia and substantially improve our scientific production by banning p-values, by reporting empirical data using clear figures with effect sizes and confidence intervals, and by learning to provide nuanced interpretations of our results. We can also dramatically raise our scientific standards by pre-specifying our analyses, fully disclosing our results, and sharing extensive replication material online. These are small but important reforms that are much more likely to improve science than methodological nitpicking on statistical testing procedures.

Presentation Slides

Download the slides (PDF 3MB)

Alt.CHI 2014 Paper – Running an HCI Experiment in Multiple Parallel Universes

Pierre Dragicevic, Fanny Chevalier and Stéphane Huot (2014) Running an HCI Experiment in Multiple Parallel Universes. In ACM CHI Conference on Human Factors in Computing Systems (alt.chi). Toronto, Canada, Apr, Apr 2014.

Why this Strange Article?

The purpose of this alt.chi paper was to raise the community's awareness on the need to question the statistical procedures we currently use to interpret and communicate our experiment results, i.e., the ritual of null hypothesis significance testing (NHST). The paper does not elaborate on the numerous problems of NHST, nor does it discuss how statistics should be ideally done, as these points have been already covered in hundreds of articles published across decades in many disciplines. The interested reader can find a non-exhaustive list of references below. For more references, simply google NHST criticism.

A comment often elicited by this paper is that HCI researchers should learn to use NHST properly, or that they should use larger sample sizes. This is not the point we are trying to make. Also, we are not implying that we should reject inferential statistics altogether, or that we should stop caring about statistical error. See the discussion thread from the alt.chi open review process. Our personal position is that HCI research can and should get rid of NHST procedures and p-values, and instead switch to reporting (preferably unstandardized) effect sizes with interval estimates — e.g., 95% confidence intervals — as recommended by many methodologists.

Video Teaser

Presentation Slides

Download the slides (PDF 5MB)

In our presentation we take an HCI perspective to rebut common arguments against the discontinuation of NHST, namely: i) the problem is NHST misuse, ii) the problem is low statistical power, iii) NHST is needed to test hypotheses or make decisions, iv) p-values and confidence intervals are equivalent anyways, v) we need both.

Quotes About Null Hypothesis Significance Testing (NHST)

The number of papers papers stressing the deep flaws of NHST is simply bewildering. Sadly, the awareness of this literature seems very low in HCI and Infovis. Yet most criticisms are not about the theoretical underpinnings of NHST, but about its usability. Will HCI and Infovis give the good example to other disciplines?

“...no scientific worker has a fixed level of significance at which from year to year, and in all circumstances, he rejects hypotheses; he rather gives his mind to each particular case in the light of his evidence and his ideas.”
   Sir Ronald A. Fisher (1956), quoted by Gigerenzer (2004).
“[NHST] is based upon a fundamental misunderstanding of the nature of rational inference, and is seldom if ever appropriate to the aims of scientific research.”
   Rozeboom (1960), quoted by Levine et al. (2008)
“Statistical significance is perhaps the least important attribute of a good experiment; it is never a sufficient condition for claiming that a theory has been usefully corroborated, that a meaningful empirical fact has been established, or that an experimental report ought to be published.”
   Likken (1968), quoted by Levine et al. (2008)
“Small wonder that students have trouble [learning significance testing]. They may be trying to think.”
   Deming (1975), quoted by Ziliak and McCloskey (2009)
“I believe that the almost exclusive reliance on merely refuting the null hypothesis as the standard method for corroborating substantive theories is a terrible mistake, is basically unsound, poor scientific strategy, and one of the worst things that ever happened in the history of psychology. I am not making some nit-picking statistician’s correction. I am saying that the whole business is so radically defective as to be scientifically almost pointless.”
   Meehl (1978), quoted by Levine et al. (2008)
“After 4 decades of severe criticism, the ritual of null hypothesis significance testing -- mechanical dichotomous decisions around a sacred .05 criterion -- still persists.”
   Jacob Cohen (1994)
“Statistical significance testing retards the growth of scientific knowledge; it never makes a positive contribution.”
   Schmidt and Hunter (1997)
“Logically and conceptually, the use of statistical significance testing in the analysis of research data has been thoroughly discredited.”
   Schmidt and Hunter (1997), quoted by Levine et al. (2008)
“D. Anderson, Burnham and W.Thompson (2000) recently found more than 300 articles in different disciplines about the indiscriminate use of NHST, and W. Thompson (2001) lists more than 400 references on this topic. [...] After review of the debate about NHST, I argue that the criticisms have sufficient merit to support the minimization or elimination of NHST.”
   Rex B Kline (2004)
“Our unfortunate historical commitment to significance tests forces us to rephrase good questions in the negative, attempt to reject those nullities, and be left with nothing we can logically say about the questions.”
   Killen (2005), quoted by Levine et al. (2008)
“If these arguments are sound, then the continuing popularity of significance tests in our peer-reviewed journals is at best embarrassing and at worst intellectually dishonest.”
   Lambdin (2012)
“Many scientific disciplines either do not understand how seriously weird, deficient, and damaging is their reliance on null hypothesis significance testing (NHST), or they are in a state of denial.”
   Geoff Cumming, in his open review of our alt.chi paper
“A ritual is a collective or solemn ceremony consisting of actions performed in a prescribed order. It typically involves sacred numbers or colors, delusions to avoid thinking about why one is performing the actions, and fear of being punished if one stops doing so. The null ritual contains all these features.”
   Gigerenzer, 2015

Links

To learn more about issues with NHST and how they can be easily addressed with interval estimates, here are two good starting points that will only take 25 minutes of your time in total.

  • (Cumming, 2011) Significant Does not Equal Important: Why we Need the New Statistics (radio interview).
    A short podcast accessible to a wide audience. Focuses on the understandability of p values vs. confidence intervals. In an ideal world, this argument alone should suffice to convince people to whom usability matters.
  • (Cumming, 2013) The Dance of p values (video).
    A nice video demonstrating how unreliable p values are across replications. Methodologist and statistical cognition researcher Geoff Cumming has communicated a lot on this particular problem, although many other problems have been covered in previous literature (references below). An older version of this video was the main source of inspiration for our alt.chi paper..

Reading List

Those are suggested initial readings. They are often quite accessible and provide a good coverage of the problems with NHST, as well as their solutions.

More Readings

Other references, some of which are rather accessible while others may be a bit technical for non-statisticians like most of us. Best is to skip the difficult parts and return to the paper later on. I have plenty of other references that I won't have time to add here, for more see my book chapter draft.

  • (Kline, 2004) Beyond significance testing: Reforming data analysis methods in behavioral research (Chapter 3).
    A broad overview of the problems of NHST and publications questioning significance testing (~300 since 1950s). Recommends to downplay or eliminate NHST, report effect sizes with confidence intervals instead, and focus on replication.
  • (Gigerenzer, 2004) Mindless statistics.
    Another (better) version of Gigerenzer's paper The Null Ritual - What You Always Wanted to Know About Significance Testing but Were Afraid to Ask. Explains why the NHST ritual is a confusing combination of Fisher and Neyman/Pearson methods. Explains why few people understand p, problems with meaningless nil hypotheses, problems with the use of "post-hoc alpha values", pressures on students by researchers, researchers by editors, etc. Important paper although not a very easy read.
  • (Ziliak and McCloskey, 2009) The Cult of Statistical Significance.
    A harsh criticism of NHST, from the authors of the book "The Cult of Statistical Significance: How the Standard Error Costs Us Jobs, Justice, and Lives ". Nicely explains why effect size is more important than precision (which refers to both p values and widths of confidence intervals).
  • (Gelman, 2013) Interrogating p-values.
    Explains towards the end why p-values are inappropriate even for making decisions, and that effect sizes should be used instead.
  • (Cumming, 2008) p Values Predict the Future Only Vaguely, but Confidence Intervals Do Much Better.
    Examines the unreliability of p values and introduces p intervals. For example, when one obtains p = .05, there is a 80% chance that the next replication yields p inside [.00008, .44], and a 20% chance it's outside.
  • (Cumming, 2012) Understanding The New Statistics: Effect Sizes, Confidence Intervals, and Meta-Analysis.
    Geoff Cumming's book: the first chapter contains a pedagogical explanation of why estimation is superior to NHST. Further chapters explain in detail how to build, plot and read confidence intervals. Recommended as a textbook, although it mostly covers simple experiment designs and assumes a basic knowledge of traditional stats. Geoff Cumming is currently writing a general stats intro textbook.
    As Cumming's popularity is growing his detractors (mostly from the Bayesian camp) are proliferating, but they sadly miss the point of Cumming's philosophy.

Effect size

Effect size does not mean complex statistics. If you report mean task completion times, or differences in mean task completion times, you're reporting effect sizes.

Bootstrapping

Bootstrapping is an easy method for computing confidence intervals for all kinds of distributions. It's too computationally intensive to have been even conceivable in the past, but now we have this thing called a computer. Oh, and it's non-deterministic — if this disturbs you, then maybe take another look at our alt.chi paper!

Statistical errors

In addition to having been greatly over-emphasized, the notion of type I error is misleading. One issue is that the nil hypothesis of no effect is almost always false, so it's practically impossible to commit a type I error. At the same time, many NHST critics stress the absurdity of testing nil hypotheses, but this is rarely what people (especially in HCI) are really doing. Some articles on this page clarify this, here are more articles specifically on the topic of statistical errors. More to come.

Multiple comparisons

How to deal with multiple comparisons? Use a simple experiment design, pick meaningful comparisons in advance, and use interval estimates.

Contrast analysis

Still looking for an accessible introduction. Suggestions welcome.

Hypotheses

HCI is a young field and rarely has deep hypotheses to test. If you only have superficial research questions, frame them in a quantitative manner (e.g., is my technique better and if so, to what extent?) and answer them using estimation. If you have something deeper to test, be careful how you do it. It's hard.

  • (Meehl, 1967) Theory-testing in Psychology and Physics: A Methodological Paradox.
    Contrasts substantive theories with statistical hypotheses, and explains why testing the latter provides only a very weak confirmation of the former. Also clarifies that point nil hypotheses in NHST are really used for testing directional hypotheses (see also section 'Errors' above). Paul Meehl is an influential methodologist and has written other important papers, but his style is a bit dry. I will try to add more references from him on this page later on.

Bayesian statistics

  • (Kruschke, 2010) Doing Bayesian Data Analysis: A Tutorial with R and BUGS.
    This book has been recommended to me several times as being an accessible introduction to Bayesian methods, although I did not read it yet. If you are willing to invest time, Bayesian (credible) intervals are worth studying as they have several advantages over confidence intervals.
  • (McElreath, 2015) Statistical Rethinking.
    This book is recommended by Matthew Kay for its very accessible introduction to Bayesian analysis. Pick this one or Kruschke's.
  • (Kruschke and Liddell, 2015) The Bayesian New Statistics: Two historical trends converge.
    Explains why estimation (rather than null hypothesis testing) and Bayesian (rather than frequentist) methods are complementary, and why we should do both.

Other papers cited in the BELIV talk

  • (Dawkins, 2011) The Tyranny of the Discontinuous Mind.
    A short but insightful article by Richard Dawkins that explains why dichotomous thinking, or more generally categorical thinking, can be problematic.
  • (Rosenthal and Gaito, 1964) Further evidence for the cliff effect in interpretation of levels of significance.
    A short summary of early research in statistical cognition showing that people trust p-values that are sightly below 0.05 much more than p-values that are slightly above. This seems to imply that Fisher's suggestion to minimize the use of decision cut-offs and simply report p-values as a measure of strength of evidence may not work.

Papers (somehow) in favor of NHST

To fight confirmation bias here is an attempt to provide credible references that go against the discontinuation of NHST. Many are valuable in that they clarify what NHST is really about.

Papers against confidence intervals

For several decades Bayesians have (more or less strongly) objected to the use of confidence intervals, but their claim that "CIs don't solve the problems of NHST" is misleading. Confidence intervals and credible (Bayesian) intervals are both interval estimates, thus a great step forward compared to dichotomous testing. Credible intervals arguably rest on more solid theoretical foundations and have a number of advantages, but they also have a higher switching cost. By claiming that confidence intervals are invalid and should never be used, Bayesian idealists may just give NHST users an excuse for sticking to their ritual.

  • (Morey et al, in press) The Fallacy of Placing Confidence in Confidence Intervals.
    This is the most recent paper on the topic, with many authors and with the boldest claims. It is very informative and shows that CIs can behave in pathological ways in specific situations. Unfortunately, it is hard to assess the real consequences, i.e., how much we are off when we interpret CIs a plausible values in realistic user studies. Also see this long discussion thread on an older reversion, especially toward the end..

Papers from the HCI Community

There has been several papers in HCI questioning current practices, although none of them calls for a discontinuation or even minimization of NHST, none of them mentions the poor replicability of p-values, and no satisfactory treatment of interval estimation is provided.

Examples of HCI and VIS papers without p-values

The papers we co-author are by no means prescriptive. We are still polishing our methods and learning. We are grateful to Geoff Cumming for his help and support.

Contact

List of people currently involved in this initiative:

License

All material on this page is CC-BY-SA. You can reuse or adapt it provided you link to www.aviz.fr/badstats.