Reproducibility crisis: How do we know how much science replicates?

Literature on the “reproducibility crisis” is increasing although not rapidly enough to bring much empirical clarity to the situation. It remains uncertain how much published science is irreproducible and whether the proportion, whatever it may be, constitutes a crisis. And like small children, unable to wait for the second marshmallow, some scientists in my twitter feed seem to have grown tired of attempting to answer these questions via the usual scientific methods; rather they are declaring in louder and louder voices that there IS a reproducibility crisis as if they can settle these questions by brute force. They have been reduced in the past few days to, I kid you not, (1) twitter polls; (2) arguing about whether 0 is a number; and most egregiously, (3) declaring that “sowing doubt” is akin to being a climate change denier.

Given that the questions at hand here have not at all been tested in the manner of climate science to yield such a consensus, this latter tactic is so outrageously beyond the pale, I am giving over my blog to comment on the reproducibility crisis for some time, writing willy-nilly as the mood hits me on topics such as its apparent size, its nature, its causes, its consequences and the proposed solutions. Keeping in mind that the readers of my blog are usually researchers in communication sciences and disorders, as well as some speech language pathologists, I will bring the topic home to speech research each time. I promise that although there may be numbers there will be no math, I leave the technical aspects to others as it is the philosophical and practical aspects of the question that concern me.

Even though I am in no mind to be logical about this at all, let’s start at the beginning, (unless you think this is the end, which would not be unreasonable). Is there in fact a consensus that that there is a reproducibility crisis? I will leave aside for the moment the fact that there is not even a consensus about what the word “reproducibility” means or what exactly to call this crisis. Notwithstanding this basic problem with concepts and categories, the evidence for the notion that there is a crisis comes from three lines of data: (1) estimates of what proportion of science can be replicated, that is if you reproduce the methods of a study with different but similar participants, are the original results confirmed or replicated; (2) survey results of scientists’ opinions about how much science can be reproduced and whether reproducibility is a crisis or not; and less frequently (3) efforts to determine whether the current rate of reproducibility or irreproducibility is a problem for scientific progress itself.

I am going to start with the first point and carry on to the others in later posts so as not to go on too long (because I am aware there is nothing new to be said really, it is just that it is hard to hear over the shouting about the consensus we are all having). I have no opinion on how much science can be replicated. I laughed when I saw the question posed in a recent Nature poll “In your opinion, what proportion of published results in your field are reproducible?” (notice that the response alternatives were percentages from 0 to 100% in increments of 10 with no “don’t know” option). The idea that people answered this! For myself I would simply have no basis for answering the question.  I say this as a person who is as well read in my field as anyone, after 30 years of research and 2 substantive books. So faced with it, I would have no choice but to abandon the poll because being a scientist, my first rule is don’t  make shit up. It’s a ridiculous question to ask anyone who has not set out to answer it specifically. But if I were to set out to answer it, I would have to approach it like any other scientific problem by asking first, what are the major concepts in the question? How can they be operationalized? How are they typically operationalized? Is there a consensus on those definitions and methods? And then having solved the basic measurement problems I would ask what are the best methods to tackle the problem. We are far from settling any of these issues in this domain and therefore it is patently false to claim that we have a consensus on the answer!

The big questions that strike me as problematic are what counts as a “published result,” more importantly what counts as a “representative sample” of published results, and finally, what counts as a “replication”. Without rehashing all the back-and-forth in Bishop’s blog and the argument that many are so familiar with we know that there is a lot of disagreement about the different ways in which these questions might be answered and what the answer might be. Currently we have estimates on the table for “how much science can be replicated” that range from “quite high” (based on back of the envelope calculations), through 30 to 47%ish (based on actual empirical efforts to replicate weird collections of findings) through finally, Ioannidis’ (2005) wonderfully trendy conclusion that “most published research findings are false” based on simulations. I do not know the answer to this question. And even if I did, I wouldn’t know how to evaluate it because I have no idea how much replication is the right amount when it comes to ensuring scientific progress. I will come back to that on another day. But for now my point is this: there is no consensus on how much published science can be replicated. And there is no consensus on how low that number needs to go before we have an actual crisis. Claiming that there is so much consensus that raising questions about the answers to these questions is akin to heresy is ridiculous and sad. Because there is one thing I do know: Scientists get to ask questions. That is what we do. More importantly, we answer them. And we especially don’t pretend to have found the answers when we have barely started to look.

I promised to put a little speech therapy into this blog so here it is. The Open Science Collaboration said reasonably that there “is no single standard for evaluating replication success. Here, we evaluated reproducibility using significance and P values, effect sizes, subjective assessments of replication teams, and meta-analysis of effect sizes.” More substantively, even if a research team picks an indicator there is disagreement about how to use it. Take effect size for example: it is not clear to me why replication attempts are expect to replicate the size of the effect size that is observed in the original study or even how one does that exactly. There is a lot of argument about that nonetheless which makes it hard to decide whether a finding has been replicated or not. How to determine whether a finding has been confirmed or replicated is not a trivial issue. I grapple with replication of my own work all the time because I develop interventions and I really want to be sure that they work. But even a small randomized controlled trial costs me seven to ten years of effort from getting funds through publication, explaining why I have accomplished only five of these in my career. Therefore, confirming my own work is no easy matter. I always hope someone else will replicate one of those trials but usually if someone has that many resources, they work on their own pet intervention, not mine. So lately I have been working on a design that makes it easier (not easy!) to test and replicate interventions on small groups of participants. It is called the single subject randomization design.

Here is some data that I will be submitting for publication soon. We treated six children with Childhood Apraxia of Speech, using an approach that involved auditory-motor integration prepractice plus normal intense speech practice (AMI). We expected it to be better than just intense speech practice alone, our neutral usual-care control condition (CTL). We also expected it to be better than an alternative treatment that is contra-indicated for kids with this diagnosis (PMP). We repeated the experiment exactly using a single subject randomization design over 6 children and then pooled the p values. All the appropriate controls for internal validity were employed (randomization with concealment, blinding and so on). The point from the perspective of this blog is that there are different kinds of information to evaluate the results, the effect sizes, the confidence intervals for the effect sizes, the p values, and the pooled p values. So, from the point of view of the reproducibility project, these are my questions: (1) how many findings will I publish here? (2) how many times did I replicate my own finding(s)?

TASC MP Group

Advertisements
Leave a comment

2 Comments

  1. Reproducibility: On the Nature of Scientific Consensus | Developmental Phonological Disorders
  2. Reproducibility: Solutions (not) | Developmental Phonological Disorders

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: