Bad Stats: Not What It Seems

Pierre Dragicevic and colleagues

This web page provides arguments and reading material to explain why it would be beneficial for human-computer interaction and information visualization to move beyond mindless null hypothesis significance testing (NHST), and focus on presenting informative charts with effect sizes and their interval estimates. Our scientific standards can also be greatly improved by planning analyses and sharing experimental material online. At the bottom of this page you will find studies published at CHI and VIS without any p-value, some of which have received best paper awards. |

# CHI 2017 Workshop – Moving Transparent Statistics Forward

With Matthew Kay, Steve Haroz, Shion Guha and Chat Wacharamanotham, I co-organized a CHI workshop titled **Moving Transparent Statistics Forward @CHI 2017** as a follow-up to our Special Interest Group on Transparent Statistics. The goal of this workshop was to start developing concrete guidelines for improving statistical practice in HCI. Contact us if you'd like to contribute.

# BioVis 2016 Primer Keynote – Statistical Dances

I gave a keynote talk at the BioVis 2016 symposium titled **Statistical Dances: Why No Statistical Analysis is Reliable and What To Do About It**.

I gave a similar talk in Grenoble in June 2017 that you can **watch here** (download link here). Thanks to François Bérard and Renaud Blanch for inviting me, and to Arnaud Legrand for organizing this.

## Summary of the Talk

It is now widely recognized that we need to improve the way we report empirical data in our scientific papers. More formal training in statistics is not enough. We also need good "intuition pumps" to develop our statistical thinking skills. In this talk I explore the basic concept of statistical dance. The dance analogy has been used by Geoff Cumming to describe the variability of p-values and confidence intervals across replications. I explain why any statistical analysis and any statistical chart dances across replications. I discuss why most attempts at stabilizing statistical dances (e.g, increasing power or applying binary accept/reject criteria) are either insufficient or misguided. The solution is to embrace the uncertainty and messiness in our data. We need to develop a good intuition of this uncertainty and communicate it faithfully to our peers. I give tips for conveying and interpreting interval estimates in our papers in a honest and truthful way.

## Material

- Video of the talk (Grenoble version). Download here in case you experience problems with Flash. © GRICAD.
- Watch the slides on your Web browser (with animations)
- Watch the collective dance on youtube
- Download the slides (PDF 5MB)
- Download the videos from the slides (ZIP 3.5MB)
- Bibliography for the talk.

Animated plots by Pierre Dragicevic and Yvonne Jansen.

## Erratum

In the video linked above I said we need to square sample size to get dances twice as small. I should have said that we need to multiply sample size by four.

# CHI 2016 – Special Interest Group on Transparent Statistics in HCI

With Matthew Kay, Steve Haroz and Shion Guha, I co-organized a SIG meeting at CHI '16 titled **Transparent Statistics in HCI**. We propose to define transparent statistics as a philosophy of statistical reporting whose purpose is scientific advancement rather than persuasion. The meeting generated lots of interest and we are now looking at forming a community and developing some concrete recommendations to move the field forward. If you are interested, join our mailing list.

# Book Chapter – Fair Statistical Communication in HCI

I wrote a book chapter titled **Fair Statistical Communication in HCI** that attempts to explain why and how we can do stats without p-values in HCI and Vis. The book is titled Modern Statistical Methods for HCI and is edited by Judy Robertson and Maurits Kaptein. I previously released an early draft as a research report titled **HCI Statistics without p-values**. However, I recommend that you use the Springer chapter instead (pre-print link below), as it's more up-to-date and much improved.

Get the final book chapter from Springer**Download author version** (PDF 1.8 MB)

See errata, updated tips, and responses.

# BELIV 2014 Keynote – Bad Stats are Miscommunicated Stats

I gave a keynote talk at the BELIV 2014 bi-annual workshop entitled **Bad Stats are Miscommunicated Stats**.

## Summary of the Talk

When reporting on user studies, we often need to do stats. But many of us have little training in statistics, and we are just as anxious about doing it right as we are eager to incriminate others for any flaw we might spot. Violations of statistical assumptions, too small samples, uncorrected multiple comparisons—deadly sins abound. But our obsession with flaw-spotting in statistical procedures makes us miss far more serious issues and the real purpose of statistics. Stats are here to help us communicate about our experimental results for the purpose of advancing scientific knowledge. Science is a cumulative and collective enterprise, so miscommunication, confusion and obfuscation are much more damaging than moderately inflated Type I error rates.

In my talk, I argue that the most common form of bad stats are miscommunicated stats. I also explain why we all have been faring terribly according to this criteria—mostly due to our blind endorsement of the concept of statistical significance. This idea promotes a form of dichotomous thinking that not only gives a highly misleading view of the uncertainty in our data, but also encourages questionable practices such as selective data analysis and various other forms of convolutions to reach the sacred .05 level. While researchers’ reliance on mechanical statistical testing rituals is both deeply entrenched and severely criticized in a range of disciplines—and has been so for more than 50 years—it is particularly striking that it has been so easily endorsed by our community. We repeatedly stress the crucial role of human judgment when analyzing data, but do the opposite when we conduct or review statistical analyses from user experiments. I believe that we can cure our schizophrenia and substantially improve our scientific production by banning p-values, by reporting empirical data using clear figures with effect sizes and confidence intervals, and by learning to provide nuanced interpretations of our results. We can also dramatically raise our scientific standards by pre-specifying our analyses, fully disclosing our results, and sharing extensive replication material online. These are small but important reforms that are much more likely to improve science than methodological nitpicking on statistical testing procedures.

## Presentation Slides

# Alt.CHI 2014 Paper – Running an HCI Experiment in Multiple Parallel Universes

Pierre Dragicevic, Fanny Chevalier and Stéphane Huot (2014) **Running an HCI Experiment in Multiple Parallel Universes**. In ACM CHI Conference on Human Factors in Computing Systems (alt.chi). Toronto, Canada, Apr, Apr 2014.

## What's the Point of this Article?

The purpose of this alt.chi paper was to raise the community's awareness on the need to question the statistical procedures we currently use to interpret and communicate our experiment results, i.e., the ritual of null hypothesis significance testing (NHST). The paper does not elaborate on the numerous problems of NHST, nor does it discuss how statistics should be ideally done, as these points have been already covered in hundreds of articles published across decades in many disciplines. The interested reader can find a non-exhaustive list of references below.

A comment often elicited by this paper is that HCI researchers should learn to use NHST properly, or that they should use larger sample sizes. This is not the point we are trying to make. Also, we are not implying that we should reject inferential statistics altogether, or that we should stop caring about statistical error. See the discussion thread from the alt.chi open review process. Our personal position is that HCI research can and should get rid of NHST procedures and p-values, and instead switch to reporting (preferably unstandardized) effect sizes with interval estimates — e.g., 95% confidence intervals — as recommended by many methodologists.

## Presentation Slides

In our presentation we take an HCI perspective to rebut common arguments against the discontinuation of NHST, namely: i) the problem is NHST misuse, ii) the problem is low statistical power, iii) NHST is needed to test hypotheses or make decisions, iv) p-values and confidence intervals are equivalent anyways, v) we need both.

# Quotes About Null Hypothesis Significance Testing (NHST)

The number of papers papers stressing the deep flaws of NHST is simply bewildering. Sadly, the awareness of this literature seems very low in HCI and Infovis. Yet most criticisms are not about the theoretical underpinnings of NHST, but about its usability. Will HCI and Infovis give the good example to other disciplines?

“...no scientific worker has a fixed level of significance at which from year to year, and in all circumstances, he rejects hypotheses; he rather gives his mind to each particular case in the light of his evidence and his ideas.”Sir Ronald A. Fisher (1956), quoted by Gigerenzer (2004). |

“[NHST] is based upon a fundamental misunderstanding of the nature of rational inference, and is seldom if ever appropriate to the aims of scientific research.”Rozeboom (1960), quoted by Levine et al. (2008) |

“Statistical significance is perhaps the least important attribute of a good experiment; it is never a sufficient condition for claiming that a theory has been usefully corroborated, that a meaningful empirical fact has been established, or that an experimental report ought to be published.”Likken (1968), quoted by Levine et al. (2008) |

“Small wonder that students have trouble [learning significance testing]. They may be trying to think.”Deming (1975), quoted by Ziliak and McCloskey (2009) |

“I believe that the almost exclusive reliance on merely refuting the null hypothesis as the standard method for corroborating substantive theories is a terrible mistake, is basically unsound, poor scientific strategy, and one of the worst things that ever happened in the history of psychology. I am not making some nit-picking statistician’s correction. I am saying that the whole business is so radically defective as to be scientifically almost pointless.”Paul Meehl (1978), quoted by Levine et al. (2008) |

“One of the most frustrating aspects of the journal business is the null hypothesis. It just will not go away. [...]. It is almost impossible to drag authors away from their p values [...]. It is not uncommon for over half the space in a results section to be composed of parentheses inside of which are test statistics, degrees of freedom, and p values. Perhaps p values are like mosquitos. They have an evolutionary niche somewhere and no amount of scratching, swatting, or spraying will dislodge them.”John Campbell (1982) |

“After 4 decades of severe criticism, the ritual of null hypothesis significance testing -- mechanical dichotomous decisions around a sacred .05 criterion -- still persists.”Jacob Cohen (1994) |

“Statistical significance testing retards the growth of scientific knowledge; it never makes a positive contribution.”Schmidt and Hunter (1997) |

“Logically and conceptually, the use of statistical significance testing in the analysis of research data has been thoroughly discredited.”Schmidt and Hunter (1997), quoted by Levine et al. (2008) |

“D. Anderson, Burnham and W.Thompson (2000) recently found more than 300 articles in different disciplines about the indiscriminate use of NHST, and W. Thompson (2001) lists more than 400 references on this topic. [...] After review of the debate about NHST, I argue that the criticisms have sufficient merit to support the minimization or elimination of NHST.”Rex B Kline (2004) |

“Our unfortunate historical commitment to significance tests forces us to rephrase good questions in the negative, attempt to reject those nullities, and be left with nothing we can logically say about the questions.”Killen (2005), quoted by Levine et al. (2008) |

“If these arguments are sound, then the continuing popularity of significance tests in our peer-reviewed journals is at best embarrassing and at worst intellectually dishonest.”Lambdin (2012) |

“Many scientific disciplines either do not understand how seriously weird, deficient, and damaging is their reliance on null hypothesis significance testing (NHST), or they are in a state of denial.”Geoff Cumming, in his open review of our alt.chi paper |

“A ritual is a collective or solemn ceremony consisting of actions performed in a prescribed order. It typically involves sacred numbers or colors, delusions to avoid thinking about why one is performing the actions, and fear of being punished if one stops doing so. The null ritual contains all these features.”Gerd Gigerenzer, 2015 |

“In the post p<0.05 era, scientific argumentation is not based on whether a p-value is small enough or not. Attention is paid to effect sizes and confidence intervals. Evidence is thought of as being continuous rather than some sort of dichotomy.”Ron Wasserstein, 2016 (executive director of the American Statistical Association) |

# Links

To learn more about issues with NHST and how they can be easily addressed with interval estimates, here are two good starting points that will only take 25 minutes of your time in total.

- (Cumming, 2011) Significant Does not Equal Important: Why we Need the New Statistics (radio interview).
*A short podcast accessible to a wide audience. Focuses on the understandability of p values vs. confidence intervals. In an ideal world, this argument alone should suffice to convince people to whom usability matters.* - (Cumming, 2013) The Dance of p values (video).

*A nice video demonstrating how unreliable p values are across replications. Methodologist and statistical cognition researcher Geoff Cumming has communicated a lot on this particular problem, although many other problems have been covered in previous literature (references below). An**older version of this video was the main source of inspiration for our alt.chi paper.*.

# Reading List

Those are suggested initial readings. They are often quite accessible and provide a good coverage of the problems with NHST, as well as their solutions.

- (Cohen, 1994) The Earth is Round (p < 0.05).
*Perhaps the most famous paper criticizing NHST.* - (Loftus, 1993) A picture is worth a thousand p values: On the irrelevance of hypothesis testing in the microcomputer age.
*Argues for using figures with means and standard errors (or CIs) instead of NHST. Illustrates with two nice examples.* - (Schmidt and Hunter, 1997) Eight common but false objections to the discontinuation of significance testing in the analysis of research data.

*Rebuts common arguments for keeping NHST. Among other things, explains why increasing sample size and power will not work as a solution, why educating researchers to minimize NHST misuse won't work either, and why dichotomous accept/reject hypothesis testing is futile. Recommends to use confidence intervals instead and to emphasize meta-analysis. This is a very convincing paper but be warned that it is sometimes quoted as making excessive claims.* - (Lambdin, 2012) Significance tests as sorcery: Science is empirical—significance tests are not.

*Another strong criticism of NHST. Summarizes arguments against NHST very nicely, and provides many references.* - (Giner-Sorolla, 2012). Science or art? How aesthetic standards grease the way through the publication bottleneck but undermine science.

*An excellent paper that explains how the publication system forces us to trade off good science for nice stories.* - (Cumming and Finch, 2005) Inference by Eye: Confidence Intervals and How to Read Pictures of Data.

*Clarifies how to read confidence intervals and their equivalence with NHST.* - (Cumming, 2014) The New Statistics: Why and How.

*Covers many topics on how to do good empirical research. Highly recommended as a general reference. The proposed guidelines have been recently adopted by the**editorial committee of the Psychological Science Journal.* - (Cumming and Calin-Jageman, 2016) Introduction to the New Statistics: Estimation, Open Science, and Beyond.

*The first introductory statistics textbook to use an estimation approach from the start. Also discusses open science and explains NHST so that students can understand published research. Very easy to read, and with plenty of exercises.*

# More Readings

Other references, some of which are rather accessible while others may be a bit technical for non-statisticians like most of us. Best is to skip the difficult parts and return to the paper later on. I have plenty of other references that I won't have time to add here, for more see my book chapter.

- (Kline, 2004) What's Wrong With Statistical Tests--And Where We Go From Here.
*A broad overview of the problems of NHST and publications questioning significance testing (~300 since 1950s). Recommends to downplay or eliminate NHST, report effect sizes with confidence intervals instead, and focus on replication.* - (Amrhein et al, 2017) The earth is flat (p>0.05): Significance thresholds and the crisis of unreplicable research.
*An up-to-date and exhaustive review of arguments against the use of a significance threshold with p-values. The authors are not for banning p-values, but provide a very strong and convincing critique of binary significance, with many references. Still at the pre-print stage.* - (Gigerenzer, 2004) Mindless statistics.

*Another (better) version of Gigerenzer's paper**The Null Ritual - What You Always Wanted to Know About Significance Testing but Were Afraid to Ask. Explains why the NHST ritual is a confusing combination of Fisher and Neyman/Pearson methods. Explains why few people understand p, problems with meaningless nil hypotheses, problems with the use of "post-hoc alpha values", pressures on students by researchers, researchers by editors, etc. Important paper although not a very easy read.* - (Ziliak and McCloskey, 2009) The Cult of Statistical Significance.

*A harsh criticism of NHST, from the authors of the book "The Cult of Statistical Significance: How the Standard Error Costs Us Jobs, Justice, and Lives ". Nicely explains why effect size is more important than precision (which refers to both p values and widths of confidence intervals).* - (Gelman, 2013) Interrogating p-values.

*Explains towards the end why p-values are inappropriate even for making decisions, and that effect sizes should be used instead.* - (Cumming, 2008) p Values Predict the Future Only Vaguely, but Confidence Intervals Do Much Better.

*Examines the unreliability of p values and introduces p intervals. For example, when one obtains p = .05, there is a 80% chance that the next replication yields p inside [.00008, .44], and a 20% chance it's outside.*

## Effect size

Effect size does not mean complex statistics. If you report mean task completion times, or differences in mean task completion times, you're reporting effect sizes.

- (Baguley, 2009). Standardized or simple effect size: What should be reported?

*Explains why it is often preferable to report effect sizes in the original units rather than computing standardized effect sizes like Cohen's d.* - (Dragicevic, 2012). My Technique is 20% Faster: Problems with Reports of Speed Improvements in HCI.

*Explains why terms such as "20% faster" are confusing.*

## Bootstrapping

Bootstrapping is an easy method for computing confidence intervals for many kinds of distributions. It's too computationally intensive to have been even conceivable in the past, but now we have this thing called a computer. Oh, and it's non-deterministic — if this disturbs you, then maybe take another look at our alt.chi paper!

- Confidence Intervals Using Bootstrapping: Statistics Help (video).

*A video explaining bootstrapping.* - Confidence intervals for the mean using bootstrapping.

*Short explanation on how to compute bootstrap CIs in R (provides several methods).* - BootES: An R Package for Bootstrap Confidence Intervals on Effect Sizes

*A scientific article introducing a powerful R package for computing all sorts of CIs. Discusses the theoretical support behind bootstrapping while remaining very accessible (for a psychology audience). Advocates the use of contrasts instead of ANOVAs.*

## Statistical errors

In addition to having been greatly over-emphasized, the notion of type I error is misleading. One issue is that the nil hypothesis of no effect is almost always false, so it's practically impossible to commit a type I error. At the same time, many NHST critics stress the absurdity of testing nil hypotheses, but this is rarely what people (especially in HCI) are really doing. Some articles on this page clarify this, here are more articles specifically on the topic of statistical errors. More to come.

- (Gelman, 2004) Type 1, type 2, type S, and type M errors (Blog bost).

*Coins type S errors (errors of sign) and type M errors (errors of magnitude). Points to an article that is more technical*. - (Pollard, 1987) On the Probability of Making Type I Errors.

*Explains why the alpha level does not give you the probability of making a type I error.*

## Multiple comparisons

How to deal with multiple comparisons? Use a simple experiment design, pick meaningful comparisons in advance, and use interval estimates.

- (Perneger, 1998) What’s wrong with Bonferroni adjustments.

*Adjustment procedures for multiple comparisons is not uncontroversial. One issue is that they dramatically inflate type II errors. This author recommends to simply report what we analyzed and why, and rely on our (and our readers') judgment. This judgment is naturally greatly facilitated by the use of an estimation approach.* - (Smith et al., 2002) The High Cost of Complexity in Experimental Design and Data Analysis -- Type I and Type II Error Rates in Multiway ANOVA.

*This paper complements the one above: to avoid the problem of multiple comparisons, best is to design simpler experiments and decide on what is worth comparing. Related to pre-specification of analyses (Cumming, 2014) and to contrast analysis.*

## Contrast analysis

*Still looking for an accessible introduction. Suggestions welcome.*

## Hypotheses

HCI is a young field and rarely has deep hypotheses to test. If you only have superficial research questions, frame them in a quantitative manner (e.g., is my technique better and if so, to what extent?) and answer them using estimation. If you have something deeper to test, be careful how you do it. It's hard.

- (Meehl, 1967) Theory-testing in Psychology and Physics: A Methodological Paradox.

*Contrasts substantive theories with statistical hypotheses, and explains why testing the latter provides only a very weak confirmation of the former. Also clarifies that point nil hypotheses in NHST are really used for testing directional hypotheses (see also section 'Errors' above). Paul Meehl is an influential methodologist and has written other important papers, but his style is a bit dry. I will try to add more references from him on this page later on.*

## Bayesian statistics

- (Kruschke, 2010) Doing Bayesian Data Analysis: A Tutorial with R and BUGS.

*This book has been recommended to me several times as being an accessible introduction to Bayesian methods, although I did not read it yet. If you are willing to invest time, Bayesian (credible) intervals are worth studying as they have several advantages over confidence intervals.* - (McElreath, 2015) Statistical Rethinking.

*This book is recommended by Matthew Kay for its very accessible introduction to Bayesian analysis. Pick this one or Kruschke's.* - (Kruschke and Liddell, 2015) The Bayesian New Statistics: Two historical trends converge.

*Explains why estimation (rather than null hypothesis testing) and Bayesian (rather than frequentist) methods are complementary, and why we should do both.*

## Other papers cited in the BELIV talk

- (Dawkins, 2011) The Tyranny of the Discontinuous Mind.

*A short but insightful article by Richard Dawkins that explains why dichotomous thinking, or more generally categorical thinking, can be problematic.* - (Rosenthal and Gaito, 1964) Further evidence for the cliff effect in interpretation of levels of significance.

*A short summary of early research in statistical cognition showing that people trust p-values that are sightly below 0.05 much more than p-values that are slightly above. This seems to imply that Fisher's suggestion to minimize the use of decision cut-offs and simply report p-values as a measure of strength of evidence may not work.*

# Papers (somehow) in favor of NHST

To fight confirmation bias here is an attempt to provide credible references that go against the discontinuation of NHST. Many are valuable in that they clarify what NHST is really about.

- (Abelson, 1995) Statistics as Principled Argument (Chapter 1).

*A book lauded by (Cairns, 2007) below. The link points to a free version of the first chapter. Focuses on arguments and story that stats serve to build. Acknowledges that NHST is only a starting point (book was reviewed by Jacob Cohen). Quite interesting and insightful.* - (Abelson, 1997) A Retrospective on the Significance Test Ban of 1999 (If There Were no Significance Tests, they Would be Invented).

*This paper wants to be in defense of NHST, and uses a dystopic scenario to achieve this (1999 is after the article publication). Both the presentation of NHST (which clarifies that NHST typically serves to test directional, not nil hypotheses) and the discussion of its misuses are excellent. However, the defense of NHST is rather unconvincing: if confidence intervals must be provided, then it is not clear why p-values should be kept.* - (Levine et al, 2008) A Critical Assessment of Null Hypothesis Significance Testing in Quantitative Communication Research.

*Although this paper's focus is on summarizing known issues with NHST, it is not against its discontinuation. It contains a very good paragraph summarizing NHST.* - (Levine et al, 2008) A Communication Researchers' Guide to Null Hypothesis Significance Testing and Alternatives.

*The companion article of the reference above. Again, says NHST should be complemented with effect size estimates and confidence intervals, but does not explain why NHST should be kept. Discusses "effect tests" and equivalence testing*.

# Papers against confidence intervals

For several decades Bayesians have (more or less strongly) objected to the use of confidence intervals, but their claim that "CIs don't solve the problems of NHST" is misleading. Confidence intervals and credible (Bayesian) intervals are both interval estimates, thus a great step forward compared to dichotomous testing. Credible intervals arguably rest on more solid theoretical foundations and have a number of advantages, but they also have a higher switching cost. By claiming that confidence intervals are invalid and should never be used, Bayesian idealists may just give NHST users an excuse for sticking to their ritual.

- (Morey et al, in press) The Fallacy of Placing Confidence in Confidence Intervals.

*This is the most recent paper on the topic, with many authors and with the boldest claims. It is very informative and shows that CIs can behave in pathological ways in specific situations. Unfortunately, it is hard to assess the real consequences, i.e., how much we are off when we interpret CIs as ranges of plausible parameter values in typical HCI user studies. Also see this long discussion thread on an older revision, especially toward the end.*

# Papers from the HCI Community (before 2015)

There has been several papers in HCI questioning current practices, although (until recently) none of them calls for a discontinuation or even minimization of NHST, none of them mentions the poor replicability of p-values, and no satisfactory treatment of interval estimation is provided.

- (Cairns, 2007) HCI... not as it should be: inferential statistics in HCI research

*A good summary of sloppy use of NHST methods in HCI, with a nice explanation of NHST in the introduction and an interesting discussion at the end. Overall in favor of NHST but duly acknowledges that it is not the only method.* - (Wilson et al, 2011) RepliCHI -- CHI should be replicating and validating results more.

*The CHI workshop on replication that started the repliCHI initiative. Replication is important, and as explained in (Cumming, 2012), estimation promotes replication and makes meta-analyses possible. NHST does neither. I have been told that the RepliCHI initiative has been discontinued.* - (Robertson, 2011) Stats: We're Doing It Wrong (blog post).

*Covers problems related to statistical assumptions and the need to report effect size, although it does not discuss deeper problems with NHST.* - (Kaptein and Robertson, 2012) Rethinking statistical analysis methods for CHI.

*Reviews a few common problems with NHST, but does not cover how some of these issues can be addressed by taking an estimation approach.* - (Hornbaek et al, 2014) Is Once Enough? On the Extent and Content of Replications in Human-Computer Interaction.

*Helpfully clarifies what a replication is in HCI, what it is good for, and to what extent it has been used in the past.*

# Examples of HCI and VIS studies without p-values

**CHI 2013**- The study by Jansen, Dragicevic and Fekete Evaluating the Efficiency of Physical Visualizations makes only limited use of NHST and comes with a page with replication material.

**2014 theses**- Yvonne Jansen's PhD thesis Physical and Tangible Information Visualization analyzes all experimental data using estimation (Chapters 5 and 6). The PhD thesis has no single p-value and has received an
**honorable mention**at the Gilles Kahn PhD thesis awards.

- Yvonne Jansen's PhD thesis Physical and Tangible Information Visualization analyzes all experimental data using estimation (Chapters 5 and 6). The PhD thesis has no single p-value and has received an
**VIS 2014**- The study by Chevalier, Dragicevic and Franconeri The Not So Staggering Effect of Staggered Animations makes no use of p-values, separates planned from exploratory analyses, and has a page with extensive replication material.
- The study by Talbot, Setlur, and Anand Four Experiments on the Perception of Bar Charts makes no use of p-values and has replication material online. Their paper replicates a well-known 1984 study by Cleveland and McGill that also used estimation. Cleveland and McGill used bootstrap confidence intervals only a few years after Efron introduced them.

**CHI 2015**- The study by Willett, Jenny, Isenberg and Dragicevic Lightweight Relief Shearing for Enhanced Terrain Perception on Interactive Maps has received a
**best paper award**despite having not a single p-value. It has replication material online. - The study by Wacharamanotham, Subramanian, Volkel and Borchers Statsplorer: Guiding Novices in Statistical Analysis also bases all of its analyses on confidence intervals and has no p-value.
- The study by Goffin, Bezerianos, Willett and Isenberg (published as an extended abstract) Exploring the Effect of Word-Scale Visualizations on Reading Behavior uses estimation and plots instead of p-values.

- The study by Willett, Jenny, Isenberg and Dragicevic Lightweight Relief Shearing for Enhanced Terrain Perception on Interactive Maps has received a
**VIS 2015**- The study by Yu, Efstathiou, Isenberg and Isenberg CAST: Effective and Efficient User Interaction for Context-Aware Selection in 3D Particle Clouds also has no p-value.
- The study by Jansen and Hornbæk A Psychophysical Investigation of Size as a Physical Variable has no p-value and shares replication/reanalysis material online.
- The study by Kay and Heer Beyond Weber’s Law: A Second Look at Ranking Visualizations of Correlation uses Bayesian estimation and only has a single p-value. All analyses are available online. The paper has received a best paper
**honorable mention award**. - The study by Boy, Eveillard, Detienne and Fekete Suggested Interactivity: Seeking Perceived Affordances for Information Visualization has no p-value either.

**2016 theses**- Chat Wacharamanotham's PhD thesis Drifts, Slips, and Misses: Input Accuracy for Touch Surfaces re-analyzes all its published studies using an estimation paradigm. It discusses the differences with NHST.
- Oleksandr Zinenko's PhD thesis Interactive Program Restructuring analyzes all experimental data using estimation, and has an appendix explaining and justifying the statistical methods used.

**CHI 2016**- The study by Zhao, Glueck, Chevalier, Wu and Khan Egocentric Analysis of Dynamic Networks with EgoLines uses estimation and descriptive statistics (no p-value), and has received a best paper
**honorable mention award**. - The study by Matejka, Glueck, Grossman, and Fitzmaurice The Effect of Visual Appearance on the Performance of Continuous Sliders and Visual Analogue Scales uses estimation only and has received a
**best paper award**. - The study by Asenov, Hilliges and Müller The Effect of Richer Visualizations on Code Comprehension also has no p-value.

- The study by Zhao, Glueck, Chevalier, Wu and Khan Egocentric Analysis of Dynamic Networks with EgoLines uses estimation and descriptive statistics (no p-value), and has received a best paper
**GI 2016**- The study by Thudt, Walny, Perin, Rajabiyazdi, MacDonald, Vardeleon, Greenberg, and Carpendale Assessing the Readability of Stacked Graphs reports all results using estimation.

**CG&A 2016**- The study by Perin, Boy and Vernier Using Gap Charts to Visualize the Temporal Evolution of Ranks and Scores reports all results using estimation.

**AVI 2016**- The study by Le Goc, Dragicevic, Huron, Boy, and Fekete A Better Grasp on Pictures Under Glass: Comparing Touch and Tangible Object Manipulation using Physical Proxies reports its results using estimation.

**VIS 2016**- The study by Dimara, Bezerianos and Dragicevic The Attraction Effect in Information Visualization uses estimation only and received a best paper
**honorable mention award**. It uses planned analyses, shares experimental material online, and has a companion paper with negative results.

- The study by Dimara, Bezerianos and Dragicevic The Attraction Effect in Information Visualization uses estimation only and received a best paper
**CHI 2017**- The study by Dimara, Bezerianos and Dragicevic Narratives in Crowdsourced Evaluation of Visualizations: A Double-Edged Sword? has no p-value and has experimental material online.
- The study by Besançon, Issartel, Ammi and Isenberg Mouse, Tactile, and Tangible Input for 3D Manipulation makes no use of p-values and uses plots with confidence intervals instead.
- The study by Besançon, Ammi and Isenberg Pressure-Based Gain Factor Control for Mobile 3D Interaction using Locally-Coupled Devices makes no use of p-values and uses plots with confidence intervals instead. It received a best paper
**honorable mention award**. - The study by Boy, Pandey, Emerson, Satterthwaite, Nov, and Bertini Showing People Behind Data: Does Anthropomorphizing Visualizations Elicit More Empathy for Human Rights Data? reports its results using confidence intervals.

**IHM 2017**- Emmanuel Dubois and Marcos Serrano published three studies using estimation only at the French-speaking conference IHM 2017. One study co-authored with Perelman, Picard, and Derras received the
**best paper award**. The other two studies were co-authored by Raynal, and by Cabric.

- Emmanuel Dubois and Marcos Serrano published three studies using estimation only at the French-speaking conference IHM 2017. One study co-authored with Perelman, Picard, and Derras received the
**VIS 2017**- The study by Walny, Huron, Perin, Wun, Pusch, and Carpendale Active Reading of Visualizations uses planned analyses, reports all results using estimation and has experimental material online.
- The study by Dragicevic and Jansen Blinded with Science or Informed by Charts? A Replication Study uses planned analyses, reports all results using estimation and has experimental material online.
- The study by Perin, Wun, Pusch, and Carpendale Assessing the Graphical Perception of Time and Speed on 2D+Time Trajectories uses planned analyses, reports all results using estimation and has experimental material online.
- The study by Hullman, Kay, Kim, and Shrestha Imagining Replications: Graphical Prediction & Discrete Visualizations Improve Recall & Estimation of Effect Uncertainty reports all results using Bayesian estimation and has experimental material online.
- The study by Felix, Bertini, and Franconeri Taking Word Clouds Apart: An Empirical Investigation of the Design Space for Keyword Summaries uses planned analyses and reports all results using estimation.
- The study by Dimara, Bezerianos and Dragicevic Conceptual and Methodological Issues in Evaluating Multidimensional Visualizations for Decision Support uses planned analyses and reports all results using estimation.
- The study by Wang, Chu, Bao, Zhu, Deussen, Chen, and Sedlmair EdWordle: Consistency-preserving Word Cloud Editing reports its results using estimation.
- The study by Valdez, Ziefle, and Sedlmair Priming and Anchoring Effects in Visualization reports most of its results using estimation.

- Contact us if you know of more papers. We'll be happy to add yours! Make sure the PDF is available online and not behind a paywall.

The papers we co-author are by no means prescriptive. We are still polishing our methods and learning. We are grateful to Geoff Cumming for his help and support.

# Contact

List of people currently involved in this initiative:

- Pierre Dragicevic, Inria
- Fanny Chevalier, Inria
- Stéphane Huot, Inria
- Yvonne Jansen, CNRS
- Wesley Willett, University of Calgary
- Charles Perin, City University of London
- Petra Isenberg, Inria
- Tobias Isenberg, Inria
- Jeremy Boy, United Nations Global Pulse
- Enrico Bertini, New-York University
- Lonni Besançon, Inria & LIMSI
- Email us if you wish to contribute!

# License

All material on this page is CC-BY-SA unless specified otherwise. You can reuse or adapt it provided you link to www.aviz.fr/badstats. |