NHST at CHI Play - A few months later

So earlier this year, April Tyack, Elisa Mekler and I submitted my first ever HCI paper about the state of Null Hypothesis Significance Testing (NHST) and Questionable Research Practices (QRPs) at CHI PLAY. At the end of July we published the accepted Preprint, and two weeks ago, in early November it was officially published at CHI PLAY 2020.

Due to the conference being virtual there is also a 6-min video presentation of the paper here: https://www.youtube.com/watch?v=QMrDEs5_GR8.1

If you haven’t read the paper, I recommend you have a look, because in this post I want to talk a bit about the behind the scenes of the paper, discuss two papers that were released in the meantime and pick up some discussions from the conference.

CHI PLAY Features Really Heterogeneous Research and That Is Great

Originally, when drafting the Pre-Registration of the paper, we wanted to get a complete look of the use of inferential statistics at CHI PLAY, not just NHST. After all, the path to knowledge is less a straight path and more the Labyrinth of Minos.

However, once we had read through the corpus and refined our codebook, we found that there was seemingly not a single case of inferential statistics using anything else but dichotomous, Neyman-Pearson hypothesis testing (also known as NHST). Later, of course, we realized that was wrong: There was one paper [4] that employed an estimation approach.

This did not come as too much of a surprise. For all intents and purposes, in HCI and the wider behavioral sciences, things like Likelihood or Bayesian approaches are rather exotic (From what I can tell in physics, for example, Bayes is a bit more common). They are also much harder to use, due to a lack of established best practices and rules of thumb (for example, that p<0.05 means an effect is significant). Instead, what DID surprise me was how many papers did NOT use any hypothesis testing in the first place. It was a pleasant realization to see quite a diverse offering of research. From case studies, to purely qualitative work and just descriptive statistics.

Now, while for many people this might not be a big deal, I was educated with the idea, that unless you test a hypothesis you did not do a real study. The concept of qualitative methods was somewhat of a red-headed step-child compared to other methods when getting my degree in psychology. So it was super refreshing to see this misguided ideal of “real study = p-values” was not a problem for CHI PLAY.

…though it did mean that we had to write many more asterisks into our paper as talking about “research at CHI PLAY”, or even “quantitative research at CHI PLAY” just did not sufficiently narrow our scope down.

Let’s talk more about ways of doing research other than testing hypotheses:

Scheel et al. 2020 [6] and Non-Confirmatory Work

So this is a big one.

In our paper we bring up the differentiation between exploratory and confirmatory research. Namely that NHST is not really designed for exploratory research, but instead works best when testing precise and risky hypotheses. I.e., testing whether A is different from B with NHST is quite useless. NHST becomes useful when we test whether A is higher than B by 6.42 points. In addition, whether NHST delivers significant results is highly dependent on the pre-test likelihood of our hypotheses. After all, getting a false positive result is impossible when an effect truly exists.

So, we only want to perform frequentist, confirmatory, significance tests, if we can formulate a statistical hypothesis with known effect sizes. So far, so rarely achievable in actual, boots-on-the-ground HCI games research.

However, a recently released pre-print by Scheel et al goes deeper into that topic and very clearly explains that testing a hypothesis should be treated more like a once-in-a-PhD-project event than a natural part of every paper.

According to Scheel et al., current confirmatory research is not fit to fully utilize the statistical rigor developed during the replication crisis, because…

  • we usually do not have hypotheses more precise than “not null” (e.g., that there will be a difference between conditions).

  • we rarely know the strength of the effect we test for (seriously: when was the last time you knew before conducting a study, how large a Cohen’s d you could expect?)

  • we are almost never working with a theory detailed enough to falsify.

If we take the theoretical foundation of our research seriously, not knowing the above points is a real problem. When performing hypothesis tests we set ourselves up to test a hypothesis derived from a theory to either corroborate the theory or falsify it. So, if the theory is not precise enough, because, for example, it doesn’t tell us anything about the effect size we can expect, we will be hard pressed to make the right inference at the end. In this example: Would a non-significant effect mean the theory is falsified? Or is the effect just smaller than expected and our study was not sensitive enough?

So what is Scheel et al.’s answer to this? They argue that instead of going ahead with studies that do not have enough theoretical support to further our understanding of a given theory (by upholding it, or falsifying it), we should “strengthen the derivation chain” by performing non-confirmatory groundwork. These works could be case studies, observations, meta-studies, literature reviews and so on. The goal of these works is to build up the foundation of our subsequent confirmatory test by increasing the pre-test likelihood of our hypothesis being true. I.e., the more precisely my theory is predicting an effect, the more likely it is that we will find a significant effect for the derived hypothesis in the end.

This will also have the added benefit that it is less likely to stumble upon non-significant results, which are still (considered) more difficult to publish.

Overall, while the paper by Scheel et al. is targeted at and talking about psychologists, many of the same problems apply to HCI. And in my opinion, the whole paper is a great read regardless of whether you’re a psychologist or not! I highly recommend it, as it makes a very good argument to dismantle the way we conduct studies as singular endeavors, and replace it by targeted efforts to systematically construct and test theories.

Cockburn et al. 2020 [2] and The Systemic Nature of The Problem

Cockburn et al’s 2018 [3] paper “Hark No More” was a big inspiration for our paper, so when “Threats of a Replication Crisis in empirical Computer Science” was published around the same time our preprint was released2, it quickly popped up on my radar… and honestly, one of the feelings I had after reading was along the lines of “goddammit, we should have done it that way!”. Of course, after another read it became clear that both papers have a slightly different focus.

Cockburn et al. very efficiently introduce the Replication Crisis, its causes, its mediating measures and its potential prevalence in HCI. With this wider approach, the authors focus much more on systemic change than we do. In general, I wholeheartedly endorse their recommendations, especially those for more transparency and more education.

Transparency, in particular, is a thing I really believe can increase trust and quality of research without much extra work, and I will talk about that a bit more further down.

And, obviously, the education of reviewers and authors is crucial – this was the goal of our paper after all. After reading through the corpus I believe the problem is a problem of naiveté. Many researchers, if they went through the joys (or pains, your pick) of statistical education, will end up with lectures and seminars that treat inferential statistics (or, let’s face it, the p-value) as a toolbox. A toolbox which can be used without thinking: If you have a nail, you use a hammer, if you have two groups, you use a t-test. Without awareness of the philosophical underpinnings and the probabilistic nature of inferential statistics, there can be no awareness of why the Replication Crisis is happening. After all, why should one care about HARKing when the significant difference is right there in the data? The difference doesn’t suddenly, falsely, appear when I decide to run an unplanned t-test, does it?3

So I think, that Cockburn et al. 2020 and our paper are rather good companion pieces. The former explains very well what the problem is and how publishers and the community can address it, while the latter hopefully serves as a tool for early career researchers or those not well-versed in statistical analysis as a starting point to improve their own work.

But, to get a bit more into detail: Where do we go from here?

Where to go from here:

Transparency

So if you like this open science stuff and want to improve your research, the first step, I think, is committing to transparency: I.e., Open Data, Open Materials, Open Analysis, and Open Access.

While the latter is hard for some due to our current publishing system putting a hefty price tag on it, publishing Open Access whenever possible should be the goal for every researcher. After all, access to knowledge is a human right [7]. Luckily at many universities there are funds available for open access and grants are also increasingly demanding some of the money going to OA publishing.

Open Data/Analyses/Materials

Opening your Data, Analyses and Materials for the public, however, is mostly in your hand and, in my book, has two major advantages: First, it allows others (best case reviewers) to double check your analysis, find potential errors and therefore allow you to fix them. Following from this, Open Data, Materials and Analysis increases trust in your research, as any mistakes that might happen during the research process can be detected and fixed (to some extent, at least). The second advantage is that it allows other researchers to more easily build on your work. I.e., your work can more easily serve as a foundation for subsequent work, gather citations and have more impact.

I always thought this call to action to be straightforward. But when asking about the open data policy of the new PACM model of CHI PLAY, the discussion of who we should require to open their data came up, and… yeah I guess there are caveats to this.

Interestingly, I neither asked for requiring people to publish data, nor directly said a blanket solution should be adopted. In fact I only asked if any push towards Open Science was planned. This leads me to believe that I hit some preconceptions here that I was not aware of. So I would like to address some of them:

Who should publish their data? I would argue that quantitative data should always be published, doubly so if it is used for inferential statistics: It is easy enough to anonymize, and it is absolutely needed by reviewers to review the quality of submitted work4.

Qualitative data is sometimes more difficult to anonymize and therefore trickier to make openly available. Though it has been done in the past at CHI PLAY (e.g., [8]) and when taken into account during data collection (especially when asking participants for consent) can be done. I am not an expert for qualitative methods, so I do not know if they can be “double checked” in the same way that quantitative methods can be. It has been argued that in some cases (e.g., when the study use inferences) qualitative data should be accompanied by quality measures such as Inter Rater Reliability [5]. I would say that in this case open data would also benefit the study. Either way, publishing qualitative data is beneficial for us as a community, as the data sets are often very rich and can answer many more questions than the original authors had in mind.

Should we force people to publish data? Mhm. I would tend to say no. I think hard rules are of limited help and risk fostering resentment, especially in a community that is not super aware of why data sharing is important. I do, however, believe that we should make sharing data the explicit norm. If researchers don’t want to publish their data they should explain why, and why we can still rely on the results.

Besides considerations for anonymity, there is the argument for limited access to research data and materials due to commercial interest:

Sometimes it is inevitable that as a researcher we have to use copyrighted materials. When one is having Open Science in mind, it might be sometimes possible to get a license allowing for the distribution alongside of the paper, but this might be expensive. So if it is not possible to publish the materials, the licensed product should be correctly sourced, and materials should be shared that prove the product working as described in the study. I.e.: If I license a videogame for my study, I should provide a video of the game being played and source it in a way that people could acquire it for themselves if needed.

Still, I would argue that in many cases there are Open Source options one could use instead: Does your study rely on SPSS for data analysis? Use JASP; Using the PENS to measure player experience? A: can you tell me where you actually got it from, because I have legit no idea where to get it, B: Maybe use one of the questionnaires (publicly) available like the UPEQ [1].

In any case, while a license would disallow you to share your materials openly, it should be very much possible to give reviewers limited access for reviewing purposes.

Though again: We should make sharing our materials the norm. If researchers can’t share something they should explain why this is the case and provide as much information as they can.

This discussion assumes that researchers can’t share something because of them using a third party product. This is fine, I think. Not perfect, but fine. BUT: If it is not possible to share crucial part of a paper (not even with the reviewers) because some stakeholder wants to sell whatever the paper is testing, then I would argue: “No! This paper should not be accepted.”

This might be my idealized understanding of the scientific process, but I believe that in this case, the paper is not making a contribution to the scientific community. It is jut someone angling for a “scientifically proven” sticker for their advert. This, I believe, is antithetical to the idea of research which is, inherently, trying to advance our shared knowledge. It is also rather unfair to other scientists whose free labor the publication process relies on.

In short: I believe Open Science should be the norm. If you can’t share something that is okay, but clearly state why. One step to implement this mindset at CHI PLAY, without much effort, would be the formal requirement of a Data/Material availability statement at the end of the paper: Ask every author to write it out why they can’t share something. This might be a small nudge, but given that many researchers might not have thought about Open Science before, this might be all they need.

Education

The second step: Read.

I know it is annoying to try to read up on yet another increasingly complex research field, but there are some easy ways to get into it: I would highly recommend Daniël Lakens Course on Coursera as an easily consumable way to quickly improve your understanding of inferential statistics. Also it is free.

Oh and probably our paper which cites a lot of good resources.

The third step: Spread the Word/Talk about it.

Open Science requires a holistic approach to research. I.e., when you plan on making your analysis code public, you should keep that in mind from the moment you first set up the script. Likewise having to pay open access publishing fees is something one should account for during the grant proposal. So having open science in the back of your head at all times, and working on making it the standard mindset in research is worthwhile. My recommendation for achieving this is join/create a Journal Club.

One such project is ReproducibiliTea which is a great project to do so and chances are, there is one at your university already. We created our own chapter at Aalto and it proved to be a great resource to learn more about the philosophy of science, emerging problems and new perspectives.

Conclusion

So what is the takeaway?

Open Science is cool, I guess.

Oh! And thinking about our research. That is cool as well! After all, we might have problems in our field, but we can fix them.

Yeah, I think this is good.

References

[1] Azadvar, A. and Canossa, A. 2018. UPEQ: Ubisoft perceived experience questionnaire: A self-determination evaluation tool for video games. Proceedings of the 13th International Conference on the Foundations of Digital Games (New York, NY, USA, Aug. 2018), 1–7.

[2] Cockburn, A. et al. 2020. Threats of a replication crisis in empirical computer science. Communications of the ACM. 63, 8 (2020), 70–79. DOI:https://doi.org/10.1145/3360311.

[3] Cockburn, A. et al. 2018. HARK No More: On the Preregistration of CHI Experiments. Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems - CHI ’18 (Montreal QC, Canada, 2018), 1–12.

[4] Dodero, G. et al. 2014. Towards Tangible Gamified Co-Design at School: Two Studies in Primary Schools. Proceedings of the First ACM SIGCHI Annual Symposium on Computer-Human Interaction in Play (New York, NY, USA, 2014), 77–86.

[5] McDonald, N. et al. 2019. Reliability and Inter-rater Reliability in Qualitative Research: Norms and Guidelines for CSCW and HCI Practice. Proceedings of the ACM on Human-Computer Interaction. 3, CSCW (Nov. 2019), 72:1–72:23. DOI:https://doi.org/10.1145/3359174.

[6] Scheel, A.M. et al. Why Hypothesis Testers Should Spend Less Time Testing Hypotheses. Perspectives on Psychological Science.

[7] Tennant, J.P. et al. 2016. The academic, economic and societal impacts of Open Access: An evidence-based review. F1000Research. 5, (2016), 1–55. DOI:https://doi.org/10.12688/f1000research.8460.1.

[8] Whitby, M.A. et al. 2019. “One of the baddies all along”: Moments that Challenge a Player ’ s Perspective. CHI PLAY ’19 Proceedings of the Annual Symposium on Computer-Human Interaction in Play. (2019), 339–350. DOI:https://doi.org/10.1145/3311350.3347192.


  1. I was also told to mention we won the Best Paper Award for it. So, yeah we did win that. Yay.↩︎

  2. Due to how CHI PLAY works, we uploaded our Preprint only once the camera-ready version had been accepted. I.e., past the point where we could have made changes, otherwise we would have referenced their paper in the released version.↩︎

  3. No it doesn’t, the problem is the random fluctuation of p-values when there is no effect, causing the p-value’s false positive likelihood to rise, when we perform more tests. See our paper.↩︎

  4. I hope it is not controversial to think reviewers should double check analyses.↩︎

Jan B. Vornhagen
Jan B. Vornhagen
PhD Fellow Digital Design
comments powered by Disqus