Reproducibility: Which Levers?

I was reading about health behavior change today and I was reminded that there is a difference between a complicated system and a complex system (D.T. Finegold and colleagues) and it crystalized for me why  the confident pronouncements of the reproducibility folks strike me as earnest but often misguided. If you think about it, most laboratory experiments are complicated systems that are meant to be roughly linear: There may be a lot of variables and many people involved in the manipulation or measurement of those variables but ultimately those manipulations and measurements should lead to observed changes in the dependent variable and then there is a conclusion; by linear system I mean that these different levels of the experiment are not supposed to contaminate each other. There are strict rules and procedures, context-specific of course, for carrying out the experiment and all the people involved need to be well trained in those procedures and they must follow the rules for the experiment to have integrity. Science itself is another matter altogether. It is a messy nonlinear dynamic complex system from which many good and some astounding results emerge, not because all the parts are perfect, but in spite of all the imperfection and possibly because of it. Shiffren, Börner and Stigler (2018) have produced a beautiful long read that describes this process of “progress despite irreproducibility.” I will leave it to them to explain it since they do it so well.

I am certain that the funders and the proponents of all the proposals to improve science are completely sincere but we all know that the road to hell is paved with good intentions. The reason that the best intentions are not going to work well in this case is that the irreproducibility folks are trying to “fix” a complex system by treating it as if it is a complicated problem. Chris Chambers tells a relatively simple tale in which a journal rejects a paper (according to his account) because a negative result was reported honestly which suggests that a focus on positive results rewards cheating to get those results and voilà: the solution is to encourage publication without the results. This idea is fleshed out by Nosek et al (2018) in a grand vision of a “preregistration revolution” which cannot possibly be implemented as imagined or result in the conceived outcomes. All possible objections have been declared to be false (bold print by Chris Chambers) and thus they have no need of my opinion. I am old enough to be starting my last cohort of students so I have just enough time to watch them to get tangled up in it. I am a patient person. I can wait to see what happens (although curiously no objective markers of the success of this revolution have been definitively put forward).

But here’s the thing. When you are predicting the future you can only look to the past. So here are the other things that I read today that lead me to be quite confident that although science will keep improving itself as it always has done, at least some of this current revolution will end up in the dust. First, on the topic of cheating, there is quite a big literature on academic cheating by undergraduate students which is directly relevant to the reproducibility movement. You will not be surprised to learn that (perceived) cheating is contagious. It is hard to know the causal direction – it is probably reciprocal. If a student believes that everyone is cheating the likelihood that the student will cheat is increased. Students who cheat believe that everyone else is cheating regardless of the actual rate of cheating. Students and athletes who are intrinsically versus extrinsically motivated are also less likely to cheat so it is not a good idea to undermine intrinsic motivation with excessive extrinsic reward systems, especially those that reduce perceived autonomy. Cheating is reduced by “creating a deeply embedded culture of integrity:” Culture is the important word here because most research and most interventions target individuals but it is culture and systems that need to be changed. Accomplishing a culture of integrity includes (perhaps you will think paradoxically) creating a trusting and supportive atmosphere with reduced competitive pressures while ensuring harsh and predictable consequences for cheating. The reproducibility movement has taken the path of deliberately inflating the statistics on the prevalence of questionable research practices with the goal of manufacturing a crisis, under the mistaken belief that the crisis narrative is necessary to motivate change when it is more likely that this narrative will actually increase cynicism and mistrust, having exactly the opposite effect.

The second article I read that was serendipitously relevant was about political polarization. Interestingly, it turns out that perceived polarization reduces trust in government whereas actual polarization between groups is not predictive of trust, political participation and so on. It is very clear to me that the proponents of this movement are deliberately polarizing and have been since the beginning, setting hard scientists against soft, men against woman and especially the young against the old (I would point to parts of my twitter feed as proof of this I but I don’t need to contaminate your day with that much negativity, suffice to say it is not a trusting and supportive atmosphere). The Pew Center shows that despite decades of a “war against science” we remain one of the most trusted groups in society. It is madness to destroy ourselves from within.

A really super interesting event that happened in my tweet feed today was the release of the report detailing the complete failure of the Gates Foundation $600M effort to improve education by waving sticks and carrots over teachers with the assumption that getting rid of bad teachers was a primary “lever” that when pulled would spit better educated minority students out the other end (seriously, they use the word levers, it cracks me up; talk about mistaking a complex system for a complicated one). Anyway, it didn’t work. The report properly points out that that the disappointing results may have occurred because their “theory of action” was wrong. There just wasn’t enough variability in teacher quality even at the outset for all that focus on teacher quality to make that much difference especially since the comparison schools were engaged in continuous improvement in teacher quality as well. But of course the response on twitter today has been focused on teacher quality: many observers figure that the bad teachers foiled the attempt through resistance, of course! The thing is that education is one of those systems in our society that actually works really well, kind of like science. If you start with the assumption that that the scientists are the problem and if you could just get someone to force them to shape up (see daydream in this blog by Lakens in which he shows that he knows nothing about professional associations despite his excellence as a statistician)…well, I think we have another case of people with money pulling on levers with no clue what is behind them.

And finally, let’s end with the Toronto Star, an excellent newspaper, that has a really long read (sorry, its long but really worth your time) describing a dramatic but successful change in a nursing home for people with dementia. It starts out as a terrible home for people with dementia and becomes a place you would (sadly but confidently) place your family member. This story is interesting because you start with the sense that everyone must have the worst motives in order for this place to be this bad—care-givers, families, funders, government—and end up realizing that everyone had absolutely the best intentions and cared deeply for the welfare of the patients. The problem was an attempt to manage the risk of error and place that goal above all others. You will see that the result of efforts to control error from the top down created the hell that the road paved with good intentions must inevitably create.

So this is it, I may be wrong and if I am it will not be the first time. But I do not think that scientists have been wasting their time for the last 30 years as one young person declared so dramatically in my twitter feed. I don’t think that they will waste the next 30 years either because they will mostly keep their eye on whatever it is that motivated them to get into this crazy business. Best we support and help each other and let each other know when we have improved something but at the same time not get too caught up in trying to control what everyone else is doing. Unless of course you are so disheartened with science you would rather give it up and join the folks in the expense account department.

Post-script on July 7, 2018: Another paper to add to this grab-bag:

Kaufman, J. C., & Glǎveanu, V. P. (2018). The Road to Uncreative Science Is Paved With Good Intentions: Ideas, Implementations, and Uneasy Balances. Perspectives on Psychological Science, 13(4), 457-465. doi:10.1177/1745691617753947

I liked this perspective on science:

“The propulsion model is concerned with how a creative work affects the field. Some types of contributions stay within the existing paradigm. Replications,1 at the most basic level, aim to reproduce or recreate a past successful creation, whereas redefinitions take a new perspective on existing work. Forward or advance forward incrementations push the field ahead slightly or a great deal, respectively. Forward incrementations anticipate where the field is heading and are often quite successful, whereas advance forward incrementations may be ahead of their time and may be recognized only retrospectively. These categories stay within the existing paradigm; others push the boundaries. Redirections, for example, try to change the way a field is moving and ake it in a new direction. Integrations aim to merge two fields, whereas reinitiation contributions seek to entirely reinvent what constitutes the field.”

 

Advertisements

Reproducibility: Solutions (not)

Let’s go back to the topic of climate change since BishopBlog started this series of blogposts off by suggesting that scientists who question the size of the reproducibility crisis are playing a role akin to climate change deniers, by analogy with Orestes and Conway’s argument in Merchants of Doubt. While some corporate actors have funded doubting groups in an effort to protect their profits, as I discussed in my previous blogpost, others have capitalized on the climate crisis to advance their own interests. Lyme disease is an interesting case study in which public concern about climate change gets spun into a business opportunity, like this: climate change → increased ticks → increased risk of tick bites → more people with common symptoms including fever and fatigue and headaches → that add up to Chronic Lyme Disease Complex → need for repeated applications of expensive treatments → such as for example chelation because heavy metals toxicities. If I lost you on that last link, well, that’s because you are a scientist. But nonscientists like Canadian Members of Parliament got drawn into this and now a federal framework to manage Lyme Disease is under development because the number of cases almost tripled over the past five years to, get this, not quite 1000 cases (confirmed and probable). The trick here is that if any one of the links seems strong to you the rest of the links shimmer into focus like the mirage that they are. And before you can blink, individually and collectively, we are hooked into costly treatments that have little evidence of benefits and tenuous links to the supposed cause of the crisis.

The “science in crisis” narrative has a similar structure with increasingly tenuous links as you work your way along the chain: pressures to publish → questionable research practices → excessive number of false positive findings published → {proposed solution} → {insert grandiose claims for magic outcomes here}. I think that all of us in academia at every level will agree that the pressures to publish are acute. Public funding of universities has declined in the U.K., the U.S. and in Canada and I am sure in many other countries as well. Therefore the competition for students and research dollars is extremely high and governments have even made what little funding there is contingent upon the attraction of those research dollars. Subsequently there is overt pressure on each professor to publish a lot (my annual salary increase is partially dependent upon my publication rate for example). Furthermore, pressure has been introduced by deliberately creating a gradient of extreme inequality among academics so that expectations for students and early career researchers are currently unrealistically high. So the first link is solid.

The second link is a hypothesis for which there is some support although it is shaky in my opinion due to the indirect nature of the evidence. Nonetheless, it is there. Chris Chambers tells this curious story where, at the age of 22 and continuing forward, he is almost comically enraged that top journals will not accept work that is good quality because the outcome was not “important” or “interesting.” And yet there are many lesser tier journals that will accept such work and many researchers have made a fine career publishing in them until such time as they were lucky enough to happen upon whatever it was that they devoted their career to finding out. The idea that luck, persistence and a lifetime of subject knowledge should determine which papers get into the “top journals” seems right to me. There is a problem when papers get into top journals only because they are momentarily attention-grabbing but that is another issue. If scientists are tempted to cheat to get their papers in those journals before their time they have only themselves to blame. One big cause of “accidental” findings that end up published in top or middling journals seems to be low power however, which can lead to all kinds of anomalous outcomes that later turn out to be unreliable. Why are so many studies underpowered? Those pressures to publish play a role as it is possible to publish many small studies rather than one big one (although curiously it is reported that publication rates have not changed in decades after co-authorship is controlled even though it seems undeniable that the pressure to publish has increased in recent times). Second, research grants are chronically too small for the proposed projects. And those grants are especially too small for woman and in fields of study that are quite frankly gendered. In Canada this can be seen in a study of grant sizes within the Natural Sciences and Engineering Research Council and by comparing the proportionately greater size of cuts to the Social Sciences and Humanities Research Council.

So now we get to the next two links in the chain. I will focus on one of the proposed solutions to the “reproducibility crisis” in this blog and come back to others in future posts. There is a lot of concern about too many false positives published in the literature (I am going to let go the question about whether this is an actual crisis or not for the time being and skip to the next link, solutions for that problem). Let’s start with the suggestion that scientists dispense with the standard alpha level of .05 for significance and replace it with p < .005 which was declared recently by a journalist (I hope the journalist and not the scientists in question) to be a raised standard for statistical significance. An alpha level is not a standard. It is a way of indicating where you think the balance should be between Type I and Type II error. But in any case, the proposed solution is essentially a semantic change. If a study yields a p-value between .05 and .005 the researcher can say that the result is “suggestive” and if it is below .005 the researcher can say that it is significant according to this proposal. The authors say that further evidence would need to accumulate to support suggestive findings but of course further evidence would need to accumulate to confirm the suggestive and the significant findings (it is possible to get small p values with an underpowered study and I thought the whole point of this crisis narrative was to get more replications!). However, with this proposal the idea seems to be to encourage studies to have a sample size 70% larger than is currently the norm. This cost is said to be offset by the benefits, but, as Timothy Bates points out, there is no serious cost-benefit analysis in their paper. And this brings me to the last link. This solution is proposed as a way of reducing false positives markedly which in turn will increase the likelihood that published findings will be reproducible. And if everyone magically found 70% more research funds this is possibly true. But where is the evidence that the crisis in science, whatever that is, would be solved? It is the magic in the final link that we really need to focus on.

I am a health care researcher so it is a reflex for me to look at a proposed cure and ask two questions (1) does the cure target the known cause of the problem? (2) is the cure problem-specific or is it a cure-all? Here we have a situation where the causal chain involves a known distal cause (pressure to publish) and known proximal cause (low power). The proposed solution (rename findings with p between .05 to .005 suggestive) does not target either of these causes. It does not help to change the research environment in such a way as to relieve the pressure to publish or to help researchers obtain the resources that would permit properly powered studies (interestingly the funders of the Open Science Collaborative have enough financial and political power to influence the system of public pensions in the United States and therefore, improving the way that research is funded and increasing job stability for academics are both goals within their means but not, as far as I can see, goals of this project). Quite the opposite in fact because this proposal is more likely to increase competition and inequality between scientists than to relieve those pressures and therefore the benefits that emerge in computer modeling could well be outweighed by the costs in actual application. Secondly, the proposed solution is not “fit for purpose”. It is an arbitrary catch-all solution that is not related to the research goals in any one field of study or research context.

That does not mean that we should do nothing and that there are no ways to improve science. Scientists are creative people and each in their own ponds have been solving problems long before these current efforts came into view. However, recent efforts that seem worthwhile to me and that directly target the issue of power (in study design) recognize the reality that those of us who research typical and atypical development in children are not likely to ever have resources to increase our sample sizes by 70%. So, three examples of helpful initiatives:

First, efforts to pull samples together through collaboration are extremely important. One that is fully on-board with the reproducibility project is of course the ManyBabies initiative. I think that this one is excellent. It takes place in the context of a field of study in which labs have always been informally  inter-connected not only because of shared interests but because of the nature of the training and interpersonal skills that are required to run those studies. Like all fields of research there has been some partisanship (I will come back to this because it is a necessary part of science) but also a lot of collaboration and cross-lab replication of studies in this field for decades now. The effort to formalize the replications and pool data is one I fully support.

Second, there have been ongoing and repeated efforts by statisticians and methodologists to teach researchers how to do simple things that improve their research. Altman sadly died this week. I have a huge collection of his wonderful papers on my hard-drive for sharing with colleagues and students who surprise me with questions like How to randomize? The series of papers by Cumming and Finch on effect sizes along with helpful spreadsheets are invaluable (although it is important to not be overly impressed by large effect sizes in underpowered studies!). My most recent favorite paper describes how to chart individual data points, really important in a field such as ours in which we so often study small samples of children with rare diagnoses. I have an example of this simple technique elsewhere on my blog. If we are going to end up calling all of our research exploratory and suggestive now (which is where we are headed, and quite frankly a lot of published research in speech-language pathology has been called that all along without ever getting to the next step), let’s at least describe those data in a useful fashion.

Third, if I may say so myself, my own effort to promote the N-of-1 randomized control design is a serious effort to improve the internal validity of single case research for researchers who, for many reasons, will not be able to amass large samples.

In the meantime, for those people suggesting the p < .005 thing, it seems irresponsible to me for any scientist to make a claim such as “reducing the P-value threshold for claims of new discoveries to 0.005 is an actionable step that will immediately improve reproducibility” on the basis of a little bit of computer modeling, some sciencey looking charts with numbers on them and not much more thought than that.  I come back to the point I made in my first blog on the reproducibility crisis and that is that if we are going to improve science we need to approach the problem like scientists. Science requires clear thinking about theory (causal models), the relationship between theory and reality, and evidence to support all the links in the chain.

Using SAILS to Assess Speech Perception in Children with SSD

I am very excited to see an Australian replication of the finding that children with a Speech Sound Disorder (SSD) have difficulty with speech perception when tested with a word identification test implemented with recordings of children’s speech. Hearnshaw, Baker, and Munro (2018) created a task modeled on my Speech Assessment and Interactive Learning (SAILS) program. A different software platform was used to present the stimuli and record the children’s responses. The critical elements of SAILS were otherwise replicated but there were some significant differences as shown in the table below.

Hearnshaw compare SAILS

The most important differences are the younger age of the children and the targeting of phonemes with older expected ages of acquisition. Furthermore there are 12 stimuli per block and two target words per target phoneme in Hearnshaw versus 10 stimuli per block and one target word per target phoneme in my own assessment studies. In Hearnshaw the practice procedures involved fewer stimuli and less training on the task. Finally, the response mode was more complex in Hearnshaw and the response alternatives do not replicate mine. Therefore this study does not constitute a replication of my own studies and I might expect lower performance levels compared to that observed by the children tested in my own studies (I say this before setting up the next table, let’s see what happens). None-the-less, we would all expect that children with SSD would underperform their counterparts with typically developing speech especially given the close matching on age and receptive vocabulary in Hearnshaw and my own studies.

Hearnshaw SAILS data comparison table

Looking at the data in the above table, the performance of the children with SSD is uniformly lower than that of the typically developing comparison groups. Hearnshaw’s SSD group obtained a lower score overall when compared to the large sample that I reported in 2006 but slightly higher when compared to the small sample that I reported in 2003 (that study was actually Alyssa Ohberg’s undergraduate honours thesis). It is not clear that any of these differences are statistically significant so I plotted them with standard error bars below.

Hearnshaw SAILS comparison figure

The chart does reinforce the impression that the differences between diagnostic groups are significant. It is not clear about the differences across studies. It is possible that the children that Alyssa tested were more severely impaired than all the others (the GFTA is not the same as the DEAP so it is difficult to compare) or more likely the best estimate is in the third study with the largest sample size. Nonetheless, the message is clear that typically developing children in this age range will achieve scores above 70% accurate whereas children with SSD are more likely to achieve scores below 70% accurate which suggests that they are largely guessing when making judgements about incorrectly produced exemplars of the target words. Hearnshaw et al. and I both emphasize the within group variance in perceptual performance by children with SSD. Therefore, it is important to assess these children’s speech perception abilities in order to plan the most suitable intervention.

And with that I am happy to announce that the iPad version of SAILS is now available with all four modules necessary to compare to the normative data that is presented below for three age groups.

SAILS Norms RBL 2018

Specifically, the modules that are currently available for purchase ($5.49 CAD per module) are as follows:

-“k”: cat (free)

-“l”: lake

-“r”: rat, rope, door

“s”: Sue, soap, bus

Please see www.dialspeech.com for more information from me and Alex Herbay who wrote the software, or go directly to the app store: SAILS by Susan Rvachew and Alex Herbay