Reproducibility: On the Nature of Scientific Consensus

The idea that scientists who raise questions about whether (ir)reproducibility is a crisis or not are like the “merchants of doubt” is argued via analogy with, for example, climate change deniers. It’s a multistep analogy. First there is an iron-clad consensus on the part of scientists that humans are causing a change in the climate that will have catastrophic consequences. Because the solutions to the problem threaten corporate interests, those big money interests astroturf groups like “Friends of Science” to sow doubt about the scientific consensus in order to derail the implementation of positive policy options. For the analogy on Bishop’s Blog to work, there must first be a consensus among scientists that the publication of irreproducible research is a crisis, a catastrophe even. I am going to talk about this issue of consensus today although it would be more fun to follow that analogy along and try to figure out whether corporate interests are threatened by more or less scientific credibility and how the analogy works when it is corporate money that is funding the consensus and not the dissenters! But anyway, on the topic of consensus…

The promoters of the reproducibility crisis have taken to simply stating that there is a consensus, citing most frequently a highly unscientific Nature poll. I know how to create scientific questionnaires (it used to be part of my job in another life before academia) and it is clear that the question “Is there a reproducibility crisis?” with the options “crisis,” “slight crisis” (an oxymoron) and “no crisis” is a push poll. The survey was designed to make it possible for people to claim “90% of respondents to a recent survey in Nature agreed that there is a reproducibility crisis” which is how you sell toothpaste, not determine whether there is a crisis or not. On twitter I have been informed, with no embarrassment, that unscientific polls are justified because they are used to “raise awareness”. The problem comes when polls that are used to create a consensus are also used as proof of that consensus. How does scientific consensus usually come about?

In many areas of science it is not typical for groups of scientists to formally declare a consensus about a scientific question but when there are public or health policy implications working groups will create consensus documents, always starting with a rigorous procedure for identifying the working group, the literature or empirical evidence that will be considered, the standards by which that evidence will be judged and the process by which the consensus will emerge. Ideally it is a dynamic and broad based exercise. The Intergovernmental Panel on Climate Change is a model in this regard and it is the rigorous nature of this process that allows us to place our trust in the consensus conclusion even when we are not experts in the area of climate. A less complex and for us more comprehensible example is the recent process employed by the CATALISE consortium to propose that Specific Language Impairment be reconceptualised as Developmental Language Disorder. This process meets all the requirements of a rigorous process with the online Delphi technique an intriguing part of the series of events that led to a set of consensus statements about the identification and classification of developmental language disorders. Ultimately each statement is supported by a rationale from the consortium members including scientific evidence when available. The consortium itself was broad based and the process permitted a full exposition of points of agreement and disagreement and needs for further research. For me, importantly, a logical sequence of events and statements is involved-the assertion that the new term be used was the end of the process, not the beginning of it. The field of speech-language pathology as a whole has responded enthusiastically even though there are financial disincentives to adopting all of the recommendations in some jurisdictions. Certainly the process of raising awareness of the consensus documents has had no need of push polls or bullying. One reason that the process was so well received, beyond respect for the actors and the process, is that the empirical support for some of the key ideas seems unassailable. Not everyone agrees on every point and we are all uncomfortable with the scourge of low powered studies in speech and language disorders (an inevitable side effect of funder neglect); however, the scientific foundation for the assertion that language impairments are not specific has reached a critical mass, and therefore no-one needs to go about beating up any “merchants of doubt” on this one. We trust that in those cases where the new approach is not adopted it is generally due to factors outside the control of the individual clinician.

The CATALISE process remains extraordinary however. More typically a consensus emerges in our field almost imperceptibly and without clear rationale. When I was a student in 1975 I was taught that children with “articulation disorders” did not have underlying speech perception deficits and therefore it would be a waste of time to implement any speech perception training procedures (full stop!). When I began to practice I had reason to question this conclusion (some things you really can see with your own eyes) so I drove into the university library (I was working far away in a rural area) and started to look stuff up. Imagine my surprise when I found that the one study cited to support this assertion involved four children who did not receive a single assessment of their speech perception skills (weird but true). Furthermore there was a long history of studies showing that children with speech sound disorders had difficulties with speech discrimination. I show just a few of these in the chart below (I heard via Twitter that, at the SPA conference just this month in Australia, Lise Baker and her students reported that 83% of all studies that have looked at this question found that children with a speech sound disorder have difficulties with speech perception). So, why was there this period from approximately 1975 through about 1995 when it was common knowledge that these kids had no difficulty with speech perception? In fact some textbooks still say this. Where did this mistaken consensus come from?

When I first found out that this mistaken consensus was contrary to the published evidence I was quite frankly incandescent with rage! I was young and naïve and I couldn’t believe I had been taught wrong stuff. But interestingly the changes in what people believed to be true were based on changes in the underlying theory which is changing all the time. In the chart below I have put the theories and the studies alongside each other in time. Notice that the McReynolds, Kohn, and Williams (1975) paper which found poorer speech perception among the SSD kids, actually concluded that they didn’t, contrary to their own data but consistent with the prevailing theory at the time!

History of Speech Perception Research

What we see is that in the fifties and sixties, when it was commonly assumed that higher level language problems were caused by impairments in lower level functions, many studies were conducted to prove this theory and in fact they found evidence to support that theory with some exceptions. In the later sixties and seventies a number of theories were in play that placed strong emphasis on innate mechanisms. There were few if any  studies conducted to examine the perceptual abilities of children with speech sound disorders because everyone just assumed they had to be normal on the basis of the burgeoning field of infant perceptual research showing that neonates could perceive anything (not exactly true but close enough for people to get a little over enthusiastic). More recently emergentist approaches have taken hold and more sophisticated techniques for testing speech perception have allowed us to determine how children perceive speech and when they will have difficulty perceiving it. The old theories have been proved wrong (not everyone will agree on this because the ideas about lower level sensory or motor deficits are zombies; the innate feature detector idea, that is completely dead; for the most part, the evidence is overwhelming and we have moved on to theories that are considerably more complex and interesting, so much so that I refer you to my book rather than trying to explain them here).

The question is, on the topic of reproducibility, whether it would have been or would be worthwhile for anyone to try and reproduce, let’s say Kronvall and Diehl (1952) just for kicks? No! That would be a serious waste of time as my master’s thesis supervisor explained to me in the eighties when he dragged me more-or-less kicking and screaming into a room with a house-sized vax computer to learn how to synthesize speech (I believe I am the first person to synthesize words with fricatives, it took me over a year). It is hard to assess the clinical impact of all that fuzzy thinking through the period 1975 – 1995. But somehow, in the long run we have ended up in a better place. My point is that scientific consensus arises from an odd and sometimes unpredictable mixture of theory and evidence and it is not always clear what is right and what is wrong until you can look back from a distance. And despite all the fuzziness and error in the process, progress marches on.


Reproducibility crisis: How do we know how much science replicates?

Literature on the “reproducibility crisis” is increasing although not rapidly enough to bring much empirical clarity to the situation. It remains uncertain how much published science is irreproducible and whether the proportion, whatever it may be, constitutes a crisis. And like small children, unable to wait for the second marshmallow, some scientists in my twitter feed seem to have grown tired of attempting to answer these questions via the usual scientific methods; rather they are declaring in louder and louder voices that there IS a reproducibility crisis as if they can settle these questions by brute force. They have been reduced in the past few days to, I kid you not, (1) twitter polls; (2) arguing about whether 0 is a number; and most egregiously, (3) declaring that “sowing doubt” is akin to being a climate change denier.

Given that the questions at hand here have not at all been tested in the manner of climate science to yield such a consensus, this latter tactic is so outrageously beyond the pale, I am giving over my blog to comment on the reproducibility crisis for some time, writing willy-nilly as the mood hits me on topics such as its apparent size, its nature, its causes, its consequences and the proposed solutions. Keeping in mind that the readers of my blog are usually researchers in communication sciences and disorders, as well as some speech language pathologists, I will bring the topic home to speech research each time. I promise that although there may be numbers there will be no math, I leave the technical aspects to others as it is the philosophical and practical aspects of the question that concern me.

Even though I am in no mind to be logical about this at all, let’s start at the beginning, (unless you think this is the end, which would not be unreasonable). Is there in fact a consensus that that there is a reproducibility crisis? I will leave aside for the moment the fact that there is not even a consensus about what the word “reproducibility” means or what exactly to call this crisis. Notwithstanding this basic problem with concepts and categories, the evidence for the notion that there is a crisis comes from three lines of data: (1) estimates of what proportion of science can be replicated, that is if you reproduce the methods of a study with different but similar participants, are the original results confirmed or replicated; (2) survey results of scientists’ opinions about how much science can be reproduced and whether reproducibility is a crisis or not; and less frequently (3) efforts to determine whether the current rate of reproducibility or irreproducibility is a problem for scientific progress itself.

I am going to start with the first point and carry on to the others in later posts so as not to go on too long (because I am aware there is nothing new to be said really, it is just that it is hard to hear over the shouting about the consensus we are all having). I have no opinion on how much science can be replicated. I laughed when I saw the question posed in a recent Nature poll “In your opinion, what proportion of published results in your field are reproducible?” (notice that the response alternatives were percentages from 0 to 100% in increments of 10 with no “don’t know” option). The idea that people answered this! For myself I would simply have no basis for answering the question.  I say this as a person who is as well read in my field as anyone, after 30 years of research and 2 substantive books. So faced with it, I would have no choice but to abandon the poll because being a scientist, my first rule is don’t  make shit up. It’s a ridiculous question to ask anyone who has not set out to answer it specifically. But if I were to set out to answer it, I would have to approach it like any other scientific problem by asking first, what are the major concepts in the question? How can they be operationalized? How are they typically operationalized? Is there a consensus on those definitions and methods? And then having solved the basic measurement problems I would ask what are the best methods to tackle the problem. We are far from settling any of these issues in this domain and therefore it is patently false to claim that we have a consensus on the answer!

The big questions that strike me as problematic are what counts as a “published result,” more importantly what counts as a “representative sample” of published results, and finally, what counts as a “replication”. Without rehashing all the back-and-forth in Bishop’s blog and the argument that many are so familiar with we know that there is a lot of disagreement about the different ways in which these questions might be answered and what the answer might be. Currently we have estimates on the table for “how much science can be replicated” that range from “quite high” (based on back of the envelope calculations), through 30 to 47%ish (based on actual empirical efforts to replicate weird collections of findings) through finally, Ioannidis’ (2005) wonderfully trendy conclusion that “most published research findings are false” based on simulations. I do not know the answer to this question. And even if I did, I wouldn’t know how to evaluate it because I have no idea how much replication is the right amount when it comes to ensuring scientific progress. I will come back to that on another day. But for now my point is this: there is no consensus on how much published science can be replicated. And there is no consensus on how low that number needs to go before we have an actual crisis. Claiming that there is so much consensus that raising questions about the answers to these questions is akin to heresy is ridiculous and sad. Because there is one thing I do know: Scientists get to ask questions. That is what we do. More importantly, we answer them. And we especially don’t pretend to have found the answers when we have barely started to look.

I promised to put a little speech therapy into this blog so here it is. The Open Science Collaboration said reasonably that there “is no single standard for evaluating replication success. Here, we evaluated reproducibility using significance and P values, effect sizes, subjective assessments of replication teams, and meta-analysis of effect sizes.” More substantively, even if a research team picks an indicator there is disagreement about how to use it. Take effect size for example: it is not clear to me why replication attempts are expect to replicate the size of the effect size that is observed in the original study or even how one does that exactly. There is a lot of argument about that nonetheless which makes it hard to decide whether a finding has been replicated or not. How to determine whether a finding has been confirmed or replicated is not a trivial issue. I grapple with replication of my own work all the time because I develop interventions and I really want to be sure that they work. But even a small randomized controlled trial costs me seven to ten years of effort from getting funds through publication, explaining why I have accomplished only five of these in my career. Therefore, confirming my own work is no easy matter. I always hope someone else will replicate one of those trials but usually if someone has that many resources, they work on their own pet intervention, not mine. So lately I have been working on a design that makes it easier (not easy!) to test and replicate interventions on small groups of participants. It is called the single subject randomization design.

Here is some data that I will be submitting for publication soon. We treated six children with Childhood Apraxia of Speech, using an approach that involved auditory-motor integration prepractice plus normal intense speech practice (AMI). We expected it to be better than just intense speech practice alone, our neutral usual-care control condition (CTL). We also expected it to be better than an alternative treatment that is contra-indicated for kids with this diagnosis (PMP). We repeated the experiment exactly using a single subject randomization design over 6 children and then pooled the p values. All the appropriate controls for internal validity were employed (randomization with concealment, blinding and so on). The point from the perspective of this blog is that there are different kinds of information to evaluate the results, the effect sizes, the confidence intervals for the effect sizes, the p values, and the pooled p values. So, from the point of view of the reproducibility project, these are my questions: (1) how many findings will I publish here? (2) how many times did I replicate my own finding(s)?