Reproducibility: Solutions (not)

Let’s go back to the topic of climate change since BishopBlog started this series of blogposts off by suggesting that scientists who question the size of the reproducibility crisis are playing a role akin to climate change deniers, by analogy with Orestes and Conway’s argument in Merchants of Doubt. While some corporate actors have funded doubting groups in an effort to protect their profits, as I discussed in my previous blogpost, others have capitalized on the climate crisis to advance their own interests. Lyme disease is an interesting case study in which public concern about climate change gets spun into a business opportunity, like this: climate change → increased ticks → increased risk of tick bites → more people with common symptoms including fever and fatigue and headaches → that add up to Chronic Lyme Disease Complex → need for repeated applications of expensive treatments → such as for example chelation because heavy metals toxicities. If I lost you on that last link, well, that’s because you are a scientist. But nonscientists like Canadian Members of Parliament got drawn into this and now a federal framework to manage Lyme Disease is under development because the number of cases almost tripled over the past five years to, get this, not quite 1000 cases (confirmed and probable). The trick here is that if any one of the links seems strong to you the rest of the links shimmer into focus like the mirage that they are. And before you can blink, individually and collectively, we are hooked into costly treatments that have little evidence of benefits and tenuous links to the supposed cause of the crisis.

The “science in crisis” narrative has a similar structure with increasingly tenuous links as you work your way along the chain: pressures to publish → questionable research practices → excessive number of false positive findings published → {proposed solution} → {insert grandiose claims for magic outcomes here}. I think that all of us in academia at every level will agree that the pressures to publish are acute. Public funding of universities has declined in the U.K., the U.S. and in Canada and I am sure in many other countries as well. Therefore the competition for students and research dollars is extremely high and governments have even made what little funding there is contingent upon the attraction of those research dollars. Subsequently there is overt pressure on each professor to publish a lot (my annual salary increase is partially dependent upon my publication rate for example). Furthermore, pressure has been introduced by deliberately creating a gradient of extreme inequality among academics so that expectations for students and early career researchers are currently unrealistically high. So the first link is solid.

The second link is a hypothesis for which there is some support although it is shaky in my opinion due to the indirect nature of the evidence. Nonetheless, it is there. Chris Chambers tells this curious story where, at the age of 22 and continuing forward, he is almost comically enraged that top journals will not accept work that is good quality because the outcome was not “important” or “interesting.” And yet there are many lesser tier journals that will accept such work and many researchers have made a fine career publishing in them until such time as they were lucky enough to happen upon whatever it was that they devoted their career to finding out. The idea that luck, persistence and a lifetime of subject knowledge should determine which papers get into the “top journals” seems right to me. There is a problem when papers get into top journals only because they are momentarily attention-grabbing but that is another issue. If scientists are tempted to cheat to get their papers in those journals before their time they have only themselves to blame. One big cause of “accidental” findings that end up published in top or middling journals seems to be low power however, which can lead to all kinds of anomalous outcomes that later turn out to be unreliable. Why are so many studies underpowered? Those pressures to publish play a role as it is possible to publish many small studies rather than one big one (although curiously it is reported that publication rates have not changed in decades after co-authorship is controlled even though it seems undeniable that the pressure to publish has increased in recent times). Second, research grants are chronically too small for the proposed projects. And those grants are especially too small for woman and in fields of study that are quite frankly gendered. In Canada this can be seen in a study of grant sizes within the Natural Sciences and Engineering Research Council and by comparing the proportionately greater size of cuts to the Social Sciences and Humanities Research Council.

So now we get to the next two links in the chain. I will focus on one of the proposed solutions to the “reproducibility crisis” in this blog and come back to others in future posts. There is a lot of concern about too many false positives published in the literature (I am going to let go the question about whether this is an actual crisis or not for the time being and skip to the next link, solutions for that problem). Let’s start with the suggestion that scientists dispense with the standard alpha level of .05 for significance and replace it with p < .005 which was declared recently by a journalist (I hope the journalist and not the scientists in question) to be a raised standard for statistical significance. An alpha level is not a standard. It is a way of indicating where you think the balance should be between Type I and Type II error. But in any case, the proposed solution is essentially a semantic change. If a study yields a p-value between .05 and .005 the researcher can say that the result is “suggestive” and if it is below .005 the researcher can say that it is significant according to this proposal. The authors say that further evidence would need to accumulate to support suggestive findings but of course further evidence would need to accumulate to confirm the suggestive and the significant findings (it is possible to get small p values with an underpowered study and I thought the whole point of this crisis narrative was to get more replications!). However, with this proposal the idea seems to be to encourage studies to have a sample size 70% larger than is currently the norm. This cost is said to be offset by the benefits, but, as Timothy Bates points out, there is no serious cost-benefit analysis in their paper. And this brings me to the last link. This solution is proposed as a way of reducing false positives markedly which in turn will increase the likelihood that published findings will be reproducible. And if everyone magically found 70% more research funds this is possibly true. But where is the evidence that the crisis in science, whatever that is, would be solved? It is the magic in the final link that we really need to focus on.

I am a health care researcher so it is a reflex for me to look at a proposed cure and ask two questions (1) does the cure target the known cause of the problem? (2) is the cure problem-specific or is it a cure-all? Here we have a situation where the causal chain involves a known distal cause (pressure to publish) and known proximal cause (low power). The proposed solution (rename findings with p between .05 to .005 suggestive) does not target either of these causes. It does not help to change the research environment in such a way as to relieve the pressure to publish or to help researchers obtain the resources that would permit properly powered studies (interestingly the funders of the Open Science Collaborative have enough financial and political power to influence the system of public pensions in the United States and therefore, improving the way that research is funded and increasing job stability for academics are both goals within their means but not, as far as I can see, goals of this project). Quite the opposite in fact because this proposal is more likely to increase competition and inequality between scientists than to relieve those pressures and therefore the benefits that emerge in computer modeling could well be outweighed by the costs in actual application. Secondly, the proposed solution is not “fit for purpose”. It is an arbitrary catch-all solution that is not related to the research goals in any one field of study or research context.

That does not mean that we should do nothing and that there are no ways to improve science. Scientists are creative people and each in their own ponds have been solving problems long before these current efforts came into view. However, recent efforts that seem worthwhile to me and that directly target the issue of power (in study design) recognize the reality that those of us who research typical and atypical development in children are not likely to ever have resources to increase our sample sizes by 70%. So, three examples of helpful initiatives:

First, efforts to pull samples together through collaboration are extremely important. One that is fully on-board with the reproducibility project is of course the ManyBabies initiative. I think that this one is excellent. It takes place in the context of a field of study in which labs have always been informally  inter-connected not only because of shared interests but because of the nature of the training and interpersonal skills that are required to run those studies. Like all fields of research there has been some partisanship (I will come back to this because it is a necessary part of science) but also a lot of collaboration and cross-lab replication of studies in this field for decades now. The effort to formalize the replications and pool data is one I fully support.

Second, there have been ongoing and repeated efforts by statisticians and methodologists to teach researchers how to do simple things that improve their research. Altman sadly died this week. I have a huge collection of his wonderful papers on my hard-drive for sharing with colleagues and students who surprise me with questions like How to randomize? The series of papers by Cumming and Finch on effect sizes along with helpful spreadsheets are invaluable (although it is important to not be overly impressed by large effect sizes in underpowered studies!). My most recent favorite paper describes how to chart individual data points, really important in a field such as ours in which we so often study small samples of children with rare diagnoses. I have an example of this simple technique elsewhere on my blog. If we are going to end up calling all of our research exploratory and suggestive now (which is where we are headed, and quite frankly a lot of published research in speech-language pathology has been called that all along without ever getting to the next step), let’s at least describe those data in a useful fashion.

Third, if I may say so myself, my own effort to promote the N-of-1 randomized control design is a serious effort to improve the internal validity of single case research for researchers who, for many reasons, will not be able to amass large samples.

In the meantime, for those people suggesting the p < .005 thing, it seems irresponsible to me for any scientist to make a claim such as “reducing the P-value threshold for claims of new discoveries to 0.005 is an actionable step that will immediately improve reproducibility” on the basis of a little bit of computer modeling, some sciencey looking charts with numbers on them and not much more thought than that.  I come back to the point I made in my first blog on the reproducibility crisis and that is that if we are going to improve science we need to approach the problem like scientists. Science requires clear thinking about theory (causal models), the relationship between theory and reality, and evidence to support all the links in the chain.

Advertisements

Using SAILS to Assess Speech Perception in Children with SSD

I am very excited to see an Australian replication of the finding that children with a Speech Sound Disorder (SSD) have difficulty with speech perception when tested with a word identification test implemented with recordings of children’s speech. Hearnshaw, Baker, and Munro (2018) created a task modeled on my Speech Assessment and Interactive Learning (SAILS) program. A different software platform was used to present the stimuli and record the children’s responses. The critical elements of SAILS were otherwise replicated but there were some significant differences as shown in the table below.

Hearnshaw compare SAILS

The most important differences are the younger age of the children and the targeting of phonemes with older expected ages of acquisition. Furthermore there are 12 stimuli per block and two target words per target phoneme in Hearnshaw versus 10 stimuli per block and one target word per target phoneme in my own assessment studies. In Hearnshaw the practice procedures involved fewer stimuli and less training on the task. Finally, the response mode was more complex in Hearnshaw and the response alternatives do not replicate mine. Therefore this study does not constitute a replication of my own studies and I might expect lower performance levels compared to that observed by the children tested in my own studies (I say this before setting up the next table, let’s see what happens). None-the-less, we would all expect that children with SSD would underperform their counterparts with typically developing speech especially given the close matching on age and receptive vocabulary in Hearnshaw and my own studies.

Hearnshaw SAILS data comparison table

Looking at the data in the above table, the performance of the children with SSD is uniformly lower than that of the typically developing comparison groups. Hearnshaw’s SSD group obtained a lower score overall when compared to the large sample that I reported in 2006 but slightly higher when compared to the small sample that I reported in 2003 (that study was actually Alyssa Ohberg’s undergraduate honours thesis). It is not clear that any of these differences are statistically significant so I plotted them with standard error bars below.

Hearnshaw SAILS comparison figure

The chart does reinforce the impression that the differences between diagnostic groups are significant. It is not clear about the differences across studies. It is possible that the children that Alyssa tested were more severely impaired than all the others (the GFTA is not the same as the DEAP so it is difficult to compare) or more likely the best estimate is in the third study with the largest sample size. Nonetheless, the message is clear that typically developing children in this age range will achieve scores above 70% accurate whereas children with SSD are more likely to achieve scores below 70% accurate which suggests that they are largely guessing when making judgements about incorrectly produced exemplars of the target words. Hearnshaw et al. and I both emphasize the within group variance in perceptual performance by children with SSD. Therefore, it is important to assess these children’s speech perception abilities in order to plan the most suitable intervention.

And with that I am happy to announce that the iPad version of SAILS is now available with all four modules necessary to compare to the normative data that is presented below for three age groups.

SAILS Norms RBL 2018

Specifically, the modules that are currently available for purchase ($5.49 CAD per module) are as follows:

-“k”: cat (free)

-“l”: lake

-“r”: rat, rope, door

“s”: Sue, soap, bus

Please see www.dialspeech.com for more information from me and Alex Herbay who wrote the software, or go directly to the app store: SAILS by Susan Rvachew and Alex Herbay

Reproducibility: On the Nature of Scientific Consensus

The idea that scientists who raise questions about whether (ir)reproducibility is a crisis or not are like the “merchants of doubt” is argued via analogy with, for example, climate change deniers. It’s a multistep analogy. First there is an iron-clad consensus on the part of scientists that humans are causing a change in the climate that will have catastrophic consequences. Because the solutions to the problem threaten corporate interests, those big money interests astroturf groups like “Friends of Science” to sow doubt about the scientific consensus in order to derail the implementation of positive policy options. For the analogy on Bishop’s Blog to work, there must first be a consensus among scientists that the publication of irreproducible research is a crisis, a catastrophe even. I am going to talk about this issue of consensus today although it would be more fun to follow that analogy along and try to figure out whether corporate interests are threatened by more or less scientific credibility and how the analogy works when it is corporate money that is funding the consensus and not the dissenters! But anyway, on the topic of consensus…

The promoters of the reproducibility crisis have taken to simply stating that there is a consensus, citing most frequently a highly unscientific Nature poll. I know how to create scientific questionnaires (it used to be part of my job in another life before academia) and it is clear that the question “Is there a reproducibility crisis?” with the options “crisis,” “slight crisis” (an oxymoron) and “no crisis” is a push poll. The survey was designed to make it possible for people to claim “90% of respondents to a recent survey in Nature agreed that there is a reproducibility crisis” which is how you sell toothpaste, not determine whether there is a crisis or not. On twitter I have been informed, with no embarrassment, that unscientific polls are justified because they are used to “raise awareness”. The problem comes when polls that are used to create a consensus are also used as proof of that consensus. How does scientific consensus usually come about?

In many areas of science it is not typical for groups of scientists to formally declare a consensus about a scientific question but when there are public or health policy implications working groups will create consensus documents, always starting with a rigorous procedure for identifying the working group, the literature or empirical evidence that will be considered, the standards by which that evidence will be judged and the process by which the consensus will emerge. Ideally it is a dynamic and broad based exercise. The Intergovernmental Panel on Climate Change is a model in this regard and it is the rigorous nature of this process that allows us to place our trust in the consensus conclusion even when we are not experts in the area of climate. A less complex and for us more comprehensible example is the recent process employed by the CATALISE consortium to propose that Specific Language Impairment be reconceptualised as Developmental Language Disorder. This process meets all the requirements of a rigorous process with the online Delphi technique an intriguing part of the series of events that led to a set of consensus statements about the identification and classification of developmental language disorders. Ultimately each statement is supported by a rationale from the consortium members including scientific evidence when available. The consortium itself was broad based and the process permitted a full exposition of points of agreement and disagreement and needs for further research. For me, importantly, a logical sequence of events and statements is involved-the assertion that the new term be used was the end of the process, not the beginning of it. The field of speech-language pathology as a whole has responded enthusiastically even though there are financial disincentives to adopting all of the recommendations in some jurisdictions. Certainly the process of raising awareness of the consensus documents has had no need of push polls or bullying. One reason that the process was so well received, beyond respect for the actors and the process, is that the empirical support for some of the key ideas seems unassailable. Not everyone agrees on every point and we are all uncomfortable with the scourge of low powered studies in speech and language disorders (an inevitable side effect of funder neglect); however, the scientific foundation for the assertion that language impairments are not specific has reached a critical mass, and therefore no-one needs to go about beating up any “merchants of doubt” on this one. We trust that in those cases where the new approach is not adopted it is generally due to factors outside the control of the individual clinician.

The CATALISE process remains extraordinary however. More typically a consensus emerges in our field almost imperceptibly and without clear rationale. When I was a student in 1975 I was taught that children with “articulation disorders” did not have underlying speech perception deficits and therefore it would be a waste of time to implement any speech perception training procedures (full stop!). When I began to practice I had reason to question this conclusion (some things you really can see with your own eyes) so I drove into the university library (I was working far away in a rural area) and started to look stuff up. Imagine my surprise when I found that the one study cited to support this assertion involved four children who did not receive a single assessment of their speech perception skills (weird but true). Furthermore there was a long history of studies showing that children with speech sound disorders had difficulties with speech discrimination. I show just a few of these in the chart below (I heard via Twitter that, at the SPA conference just this month in Australia, Lise Baker and her students reported that 83% of all studies that have looked at this question found that children with a speech sound disorder have difficulties with speech perception). So, why was there this period from approximately 1975 through about 1995 when it was common knowledge that these kids had no difficulty with speech perception? In fact some textbooks still say this. Where did this mistaken consensus come from?

When I first found out that this mistaken consensus was contrary to the published evidence I was quite frankly incandescent with rage! I was young and naïve and I couldn’t believe I had been taught wrong stuff. But interestingly the changes in what people believed to be true were based on changes in the underlying theory which is changing all the time. In the chart below I have put the theories and the studies alongside each other in time. Notice that the McReynolds, Kohn, and Williams (1975) paper which found poorer speech perception among the SSD kids, actually concluded that they didn’t, contrary to their own data but consistent with the prevailing theory at the time!

History of Speech Perception Research

What we see is that in the fifties and sixties, when it was commonly assumed that higher level language problems were caused by impairments in lower level functions, many studies were conducted to prove this theory and in fact they found evidence to support that theory with some exceptions. In the later sixties and seventies a number of theories were in play that placed strong emphasis on innate mechanisms. There were few if any  studies conducted to examine the perceptual abilities of children with speech sound disorders because everyone just assumed they had to be normal on the basis of the burgeoning field of infant perceptual research showing that neonates could perceive anything (not exactly true but close enough for people to get a little over enthusiastic). More recently emergentist approaches have taken hold and more sophisticated techniques for testing speech perception have allowed us to determine how children perceive speech and when they will have difficulty perceiving it. The old theories have been proved wrong (not everyone will agree on this because the ideas about lower level sensory or motor deficits are zombies; the innate feature detector idea, that is completely dead; for the most part, the evidence is overwhelming and we have moved on to theories that are considerably more complex and interesting, so much so that I refer you to my book rather than trying to explain them here).

The question is, on the topic of reproducibility, whether it would have been or would be worthwhile for anyone to try and reproduce, let’s say Kronvall and Diehl (1952) just for kicks? No! That would be a serious waste of time as my master’s thesis supervisor explained to me in the eighties when he dragged me more-or-less kicking and screaming into a room with a house-sized vax computer to learn how to synthesize speech (I believe I am the first person to synthesize words with fricatives, it took me over a year). It is hard to assess the clinical impact of all that fuzzy thinking through the period 1975 – 1995. But somehow, in the long run we have ended up in a better place. My point is that scientific consensus arises from an odd and sometimes unpredictable mixture of theory and evidence and it is not always clear what is right and what is wrong until you can look back from a distance. And despite all the fuzziness and error in the process, progress marches on.

Reproducibility crisis: How do we know how much science replicates?

Literature on the “reproducibility crisis” is increasing although not rapidly enough to bring much empirical clarity to the situation. It remains uncertain how much published science is irreproducible and whether the proportion, whatever it may be, constitutes a crisis. And like small children, unable to wait for the second marshmallow, some scientists in my twitter feed seem to have grown tired of attempting to answer these questions via the usual scientific methods; rather they are declaring in louder and louder voices that there IS a reproducibility crisis as if they can settle these questions by brute force. They have been reduced in the past few days to, I kid you not, (1) twitter polls; (2) arguing about whether 0 is a number; and most egregiously, (3) declaring that “sowing doubt” is akin to being a climate change denier.

Given that the questions at hand here have not at all been tested in the manner of climate science to yield such a consensus, this latter tactic is so outrageously beyond the pale, I am giving over my blog to comment on the reproducibility crisis for some time, writing willy-nilly as the mood hits me on topics such as its apparent size, its nature, its causes, its consequences and the proposed solutions. Keeping in mind that the readers of my blog are usually researchers in communication sciences and disorders, as well as some speech language pathologists, I will bring the topic home to speech research each time. I promise that although there may be numbers there will be no math, I leave the technical aspects to others as it is the philosophical and practical aspects of the question that concern me.

Even though I am in no mind to be logical about this at all, let’s start at the beginning, (unless you think this is the end, which would not be unreasonable). Is there in fact a consensus that that there is a reproducibility crisis? I will leave aside for the moment the fact that there is not even a consensus about what the word “reproducibility” means or what exactly to call this crisis. Notwithstanding this basic problem with concepts and categories, the evidence for the notion that there is a crisis comes from three lines of data: (1) estimates of what proportion of science can be replicated, that is if you reproduce the methods of a study with different but similar participants, are the original results confirmed or replicated; (2) survey results of scientists’ opinions about how much science can be reproduced and whether reproducibility is a crisis or not; and less frequently (3) efforts to determine whether the current rate of reproducibility or irreproducibility is a problem for scientific progress itself.

I am going to start with the first point and carry on to the others in later posts so as not to go on too long (because I am aware there is nothing new to be said really, it is just that it is hard to hear over the shouting about the consensus we are all having). I have no opinion on how much science can be replicated. I laughed when I saw the question posed in a recent Nature poll “In your opinion, what proportion of published results in your field are reproducible?” (notice that the response alternatives were percentages from 0 to 100% in increments of 10 with no “don’t know” option). The idea that people answered this! For myself I would simply have no basis for answering the question.  I say this as a person who is as well read in my field as anyone, after 30 years of research and 2 substantive books. So faced with it, I would have no choice but to abandon the poll because being a scientist, my first rule is don’t  make shit up. It’s a ridiculous question to ask anyone who has not set out to answer it specifically. But if I were to set out to answer it, I would have to approach it like any other scientific problem by asking first, what are the major concepts in the question? How can they be operationalized? How are they typically operationalized? Is there a consensus on those definitions and methods? And then having solved the basic measurement problems I would ask what are the best methods to tackle the problem. We are far from settling any of these issues in this domain and therefore it is patently false to claim that we have a consensus on the answer!

The big questions that strike me as problematic are what counts as a “published result,” more importantly what counts as a “representative sample” of published results, and finally, what counts as a “replication”. Without rehashing all the back-and-forth in Bishop’s blog and the argument that many are so familiar with we know that there is a lot of disagreement about the different ways in which these questions might be answered and what the answer might be. Currently we have estimates on the table for “how much science can be replicated” that range from “quite high” (based on back of the envelope calculations), through 30 to 47%ish (based on actual empirical efforts to replicate weird collections of findings) through finally, Ioannidis’ (2005) wonderfully trendy conclusion that “most published research findings are false” based on simulations. I do not know the answer to this question. And even if I did, I wouldn’t know how to evaluate it because I have no idea how much replication is the right amount when it comes to ensuring scientific progress. I will come back to that on another day. But for now my point is this: there is no consensus on how much published science can be replicated. And there is no consensus on how low that number needs to go before we have an actual crisis. Claiming that there is so much consensus that raising questions about the answers to these questions is akin to heresy is ridiculous and sad. Because there is one thing I do know: Scientists get to ask questions. That is what we do. More importantly, we answer them. And we especially don’t pretend to have found the answers when we have barely started to look.

I promised to put a little speech therapy into this blog so here it is. The Open Science Collaboration said reasonably that there “is no single standard for evaluating replication success. Here, we evaluated reproducibility using significance and P values, effect sizes, subjective assessments of replication teams, and meta-analysis of effect sizes.” More substantively, even if a research team picks an indicator there is disagreement about how to use it. Take effect size for example: it is not clear to me why replication attempts are expect to replicate the size of the effect size that is observed in the original study or even how one does that exactly. There is a lot of argument about that nonetheless which makes it hard to decide whether a finding has been replicated or not. How to determine whether a finding has been confirmed or replicated is not a trivial issue. I grapple with replication of my own work all the time because I develop interventions and I really want to be sure that they work. But even a small randomized controlled trial costs me seven to ten years of effort from getting funds through publication, explaining why I have accomplished only five of these in my career. Therefore, confirming my own work is no easy matter. I always hope someone else will replicate one of those trials but usually if someone has that many resources, they work on their own pet intervention, not mine. So lately I have been working on a design that makes it easier (not easy!) to test and replicate interventions on small groups of participants. It is called the single subject randomization design.

Here is some data that I will be submitting for publication soon. We treated six children with Childhood Apraxia of Speech, using an approach that involved auditory-motor integration prepractice plus normal intense speech practice (AMI). We expected it to be better than just intense speech practice alone, our neutral usual-care control condition (CTL). We also expected it to be better than an alternative treatment that is contra-indicated for kids with this diagnosis (PMP). We repeated the experiment exactly using a single subject randomization design over 6 children and then pooled the p values. All the appropriate controls for internal validity were employed (randomization with concealment, blinding and so on). The point from the perspective of this blog is that there are different kinds of information to evaluate the results, the effect sizes, the confidence intervals for the effect sizes, the p values, and the pooled p values. So, from the point of view of the reproducibility project, these are my questions: (1) how many findings will I publish here? (2) how many times did I replicate my own finding(s)?

TASC MP Group

Is Acoustic Feedback Effective for Remediating “r” Errors?

I am very pleased to see a third paper published in the speech-language pathology literature using the single-subject randomization design that I have described in two tutorials, the first in 1988 and the second more recently. Tara McAllister Byun used the design to investigate the effectiveness of acoustic biofeedback treatment to remediate persistent “r” errors in 7 children aged 9 to 15 years. She used the single subject randomized alternation design with block randomization, including a few unique elements in her implementation of the design. She and her research team provided one traditional treatment session and one biofeedback treatment session each week for ten weeks. However the order of the traditional and biofeedback sessions was randomized each week. Interestingly, each session targeted the same items (i.e., “r” was the speech sound target  in both treatment conditions): rhotic vowels were tackled first and consonantal “r” was introduced later, in a variety of phonetic contexts. (This procedure is a variance from my experience in which, for example, Tanya Matthews and I randomly assign different targets to different treatment conditions). Another innovation is the outcome measure: a probe constructed of untreated “r” words was given at the beginning and end of each session so that change (Mdif) over the session was the outcome measure submitted to statistical analysis (our tutorial explains that the advantage of the SSRD is that a nonparametric randomization test can be used to assess the outcome of the study, yielding a p value).  In addition, 3 baseline probes and 3 maintenance probes were collected so that an effect size for overall improvement could be calculated. In this way there are actually 3 time scales for measuring change in this study: (1) change from baseline to maintenance probes; (2) change from baseline to treatment performance as reflected in the probes obtained at the beginning of each session and plotted over time; and (3) change over a session, reflected in the probes given at the beginning and the end of each session. Furthermore, it is possible to compare differences in within session change for sessions provided with and without acoustic feedback.

I was really happy to see the implementation of the design but it is fair to say that the results were a dog’s breakfast, as summarized below:

Byun 2017 acoustic biofeedback

The table indicates that two participants (Piper, Clara) showed an effect of biofeedback treatment and generalization learning. Both showed rapid change in accuracy overall after treatment was introduced in both conditions and maintained at least some of that improvement after treatment was withdrawn. Garrat and Ian showed identical trajectories in the traditional and biofeedback conditions with a late rise in accuracy during treatment session, large within session improvements during the latter part of the treatment period, and good maintenance of those gains. Neither boy achieved 60% correct responding however at any point in the treatment program. Felix, Lucas and Evan demonstrated no change in probe scores across the twenty weeks of the experiment in both conditions. Lucas started at a higher level and therefore his probe performance is more variable: because he actually showed a within session decline during traditional sessions while showing stable performance within biofeedback sessions, the statistics indicate a treatment effect in favour of acoustic biofeedback but in fact no actual gains are observed.

So, this is a long description of the results that brings me to two conclusions: (1) the alternation design was the wrong choice for the hypothesis in these experiments; and (2) biofeedback was not effective for these children; even in those cases where it looks like there was an effect, the children were responsive to both biofeedback and the traditional intervention.

In a previous blog, I described the alternation design; there is another version of the single subject randomization design that would be more appropriate for Tara’s hypothesis however.  The thing about acoustic biofeedback is that it is not fundamentally different from traditional speech therapy, involving a similar sequence of events: (i) SLP says a word as an imitative model; (ii) child imitates the word; (iii) SLP provides informative or corrective feedback. In the case of incorrect responses in the traditional condition in Byun’s study, the SLP provided information about articulatory placement and reminded the child that the target involved certain articulatory movements (“make the back part of your tongue go back”). In the case of incorrect responses in the acoustic biofeedback condition, the SLP made reference to the acoustic spectrogram when providing feedback and reminded the child that the target involved certain formant movements (“make the third bump move over”). Firstly, the first two steps are completely overlapping in both conditions and secondly it can be expected that the articulatory cues given in the traditional condition will be remembered and their effects will carry-over into the biofeedback sessions. Therefore we can consider the acoustic biofeedback to be an add-on to traditional therapy. We want to know about the value added. Therefore the phase design is more appropriate: in this case, there would be 20 sessions (2 per week over 10 weeks as in Byun’s study), each session would be planned with the same format: beginning probe (optional), 100 practice trials with feedback, ending probe. The difference is that the starting point for the introduction of acoustic biofeedback would be selected at random. All the sessions that precede the randomly selected start point would be conducted with traditional feedback and all the remainder would be conducted with acoustic biofeedback. The first three would be designated as traditional and the last 3 would be designated as biofeedback for a 26 session protocol as described by Byun. Across the 7 children this would end up looking like a multiple baseline design except that (1) the duration of the baseline phase would be determined by random selection for each child; and (2) the baseline phase is actually the traditional treatment with the experimental phase testing the value added benefit of biofeedback. There are three possible categories of outcomes: no change after introduction of the biofeedback, an immediate change, or a late change. As with any single subject design, the change might be in level, trend or variance and the test statistic can be designed to capture any of those types of changes. The statistical analysis asks whether the obtained test statistic is bigger than all possible results given all of the possible random selection of starting points. Rvachew & Matthews (2016) provides a more complete  explanation of the statistical analysis.

I show below an imaginary result for Clara, using the data presented for her in Byun’s paper, as if the traditional treatment came first and then the biofeedback intervention. If we pretend that the randomly selected start point for the biofeedback intervention occurred exactly in the middle of the treatment period, the test statistic is the difference of the M(bf) and the M(trad) scores resulting in -2.308. All other possible random selections of starting points for intervention lead to 19 other possible mean differences, and 18 of them are bigger than the obtained test statistic leading to a p value of 18/20 = .9. In this data set the probe scores are actually bigger in the earlier part of the intervention when the traditional treatment is used and they do not get bigger when the biofeedback is introduced. These are the beginning probe scores obtained by Clara but Byun obtained a significant result in favour of biofeedback by block randomization and by examining change across each session. However, I am not completely sure that the improvements from beginning to ending probes are a positive sign—this result might reflect a failure to maintain gains from the previous session in one or the other condition.

Hypothetical Clara in SSR Phase Design

There are several reasons to think that both interventions that were used in Byun’s study might result in unsatisfactory generalization and maintenance. We discuss the principles of generalization in relation to theories of motor learning in Developmental Phonological Disorders: Foundations of Clinical Practice. One important principle is that the child needs a well-established representation of the acoustic-phonetic target. All seven of the children in Byun’s study had poor auditory processing skills but no part of the treatment program addressed phonological processing, phonological knowledge or acoustic phonetic representations. Second, it is essential to have the tools to monitor and use self-produced feedback (auditory, somatosensory) to evaluate success in achieving the target. Both the traditional and the biofeedback intervention put the child in the position of being dependent upon external feedback. The outcome measure focused attention on improvements from the beginning of the practice session to the end. The first principle of motor learning is that practice performance is not an indication of learning however.  The focus should have been on the sometimes large decrements in probe scores from the end of one session to the beginning of the next. The children had no means of maintaining any of those performance gains. Acoustic feedback may be a powerful means of establishing a new response but it is a counterproductive tool for maintenance and generalization learning.

Reading

McAllister Byun, T. (2017). Efficacy of Visual–Acoustic Biofeedback Intervention for Residual Rhotic Errors: A Single-Subject Randomization Study. Journal of Speech, Language, and Hearing Research, 60(5), 1175-1193. doi:10.1044/2016_JSLHR-S-16-0038

Rvachew, S., & Matthews, T. (2017). Demonstrating treatment efficacy using the single subject randomization design: A tutorial and demonstration. Journal of Communication Disorders, 67, 1-13. doi:https://doi.org/10.1016/j.jcomdis.2017.04.003

 

How effective is phonology treatment?

Previously I asked whether it made sense to calculate effect sizes for phonology therapy at the within subject level. In other words, from the clinical point of view, do we really want to know whether the child’s rate of change is bigger during treatment than it was when the child was not being treated? Or, do we want to know if the child’s rate of change is bigger than the average amount of change observed among groups of children who get treated? If children who get treated typically change quite a bit and your client is not changing much at all, that might indicate a course correction (and note please, not a treatment rest!). From this perspective, group level effect sizes might be useful so I am providing raw and standardized effect sizes here from three of my past studies with a discussion to follow.

Rvachew, S., & Nowak, M. (2001). The effect of target selection strategy on sound production learning. Journal of Speech, Language, and Hearing Research, 44, 610-623.

The first data set involves 48 four-year-old children who scored at the second percentile, on average, on the GFTA (and 61 percent consonants correct in conversation). They were randomly assigned to receive treatment for relatively early developing stimulable sound targets (ME group, n=24) or late developing unstimulable sound targets (LL group, n=24). Each received treatment for four sounds over 2 six-week blocks, during 12 30 to 40 minute treatment sessions. The treatment approach employed traditional articulation therapy procedures. The children did not receive homework or additional speech and language interventions during this 12 week period. Outcome measures included single word naming probes covering all consonants in 3 word positions and percent consonants correct (PCC) in conversation, with 12 to 14 weeks intervening between the pre- and the post-test assessments. The table below shows two kinds of effect sizes for the ME group and the LL group: the raw effect size (raw ES) with the associated confidence interval (CI) which indicates the mean pre- to post-change in percent consonants corrects on probes and in conversation; next is the standardized mean difference, Cohen’s d(z); finally, I show the number and percentage of children who did not change (0 and negative change scores). These effect sizes are shown for three outcome measures: single word naming probe scores for unstimulable phonemes, probe scores for stimulable phonemes, and percent consonants correct (PCC) obtained from conversations recorded while the child looked at a wordless picture book with the assessor.Effect size blog figure 2.

Some initial conclusions can be drawn from this table. The effect sizes for change in probe scores are all large. However, the group that received treatment for stimulable sounds showed greater improvement for both treated stimulable sounds and untreated unstimulable sounds compared to the group that received treatment for unstimulable sounds. There was almost no change in PCC derived from the conversational samples overall. I can report that 10 children in the ME group and 6 children in the LL group achieved improvements of greater than 5 PCC points, judged to be a “minimally important change”  by Thomas-Stonell et al. (2013). Half the children achieved no change at all however in PCC (conversation).

Rvachew, S., Nowak, M., & Cloutier, G. (2004). Effect of phonemic perception training on the speech production and phonological awareness skills of children with expressive phonological delay. American Journal of Speech-Language Pathology, 13, 250-263.

The second data set involves 34 four-year-old children who scored at the second percentile, on average, on the GFTA (and approximately 60 percent consonants correct in conversation). All of the children received 16 hour-long speech therapy sessions, once-weekly. The treatment that they received was entirely determined by their SLP with regard to target selection and approach to intervention. Ten SLPs provided the interventions, 3 using the Hodson cycles approach, 1 a sensory motor approach and the remainder using a traditional articulation therapy approach. The RCT element of this study is that the children were randomly assigned to an extra treatment procedure that occurred during the final 15 minutes of each session, concealed from their SLP. Children in the control group (n=17) listened to ebooks and answered questions. Children randomly assigned to the PA group (n=17) played a computer game that targeted phonemic perception and phonological awareness covering 8 phonemes in word initial and then word final position. Although the intervention lasted 4 months, the interval between pre-treatment and post-treatment assessments was 6 months long. The table below shows two kinds of effect sizes for the ME group and the LL group: the raw effect size (raw ES) with the associated confidence interval (CI) indicates the mean pre- to post-change in percent consonants correct; next is the standardized mean difference, Cohen’s d(z); finally, I show the number and percentage of children who did not change (0 and negative change scores). These effect sizes are shown for two outcome measures: percent consonants correct (PCC) obtained from conversations recorded while the child looked at a wordless picture book with the assessor; and PCC-difficult, derived from the same conversations but restricted to phonemes that were produced with less than 60% accuracy at intake-in other words, phonemes that were potential treatment targets, specifically /ŋ,k,ɡ,v,ʃ,ʧ,ʤ,θ,ð,s,z,l,ɹ/.

Effect size blog figure 3

The sobering finding here is that the control group effect size for potential treatment targets is the smallest, with half the group making no change and the other half making a small change. The effect size for PCC (all) in the control group is more satisfying in that it is better than the minimally important change (i.e., 8% > 5%); 13 children in this group achieved a change of more than 5 points and only 3 made no change at all. The effect sizes are large in the group that received the Speech Perception/PA intervention in addition to their regular SLP program with good results for PCC (all) and PCC-difficult. This table shows that the SLP’s choice of treatment procedures makes a difference to speech accuracy outcomes.

Rvachew, S., & Brosseau-Lapré, F. (2015). A randomized trial of twelve week interventions for the treatment of developmental phonological disorder in francophone children. American Journal of Speech-Language Pathology, 24, 637-658. doi:10.1044/2015_AJSLP-14-0056

The third data set involves data from 64 French-speaking four-year-olds who were randomly assigned to receive either an output oriented intervention (n = 30) or an input-oriented intervention (n = 34) for remediation of their speech sound disorder. Another 10 children who were not treated also provide effect size data here. The children obtained PCC scores of approximately 70% on the Test Francophone de Phonologie, indicating severe speech sound disorder (consonant accuracy is typically higher in French-speaking children, compared to English). The children received other interventions as well as described in the research report (home programs and group phonological awareness therapy) with the complete treatment program lasting 12 weeks. The table below shows two kinds of effect sizes for the ME group and the LL group: the raw effect size (raw ES) with the associated confidence interval (CI) indicates the mean pre- to post-change in percent consonants correct; next is the standardized mean difference, Cohen’s d(z); finally, I show the number and percentage of children who did not change (0 and negative change scores). These effect sizes are shown for two outcome measures: percent consonants correct with glides excluded (PCC), obtained from the Test Francophone de Phonologie, a single word naming test; PCC-difficult, derived from the same test but restricted to phonemes that were produced with less than 60% accuracy at intake-specifically /ʃ,ʒ,l,ʁ/. An outcome measure restricted to phonemes that were absent from the inventory at intake is not possible for this group because French-speaking children with speech sound disorders have good phonetic repertoires for the most part as their speech errors tend to involve syllable structure (see Brosseau-Lapré and Rvachew, 2014).

Effectsize blog figure 4

There are two satisfying findings here: first, when we do not treat children with a speech sound disorder, they do not change, and when we do treat them, they do! Second, when children receive an appropriate suite of treatment elements, large changes in PCC can be observed even over an observation interval as short as 12 weeks.

Overall Conclusions

  1. In the introductory blog to this series, I pointed out that Thomas-Stonell and her colleagues had identified a PCC change of 5 points as a “minimally important change”. The data presented here suggests that this goal can be met for most children over a 3 to 6 months period when children are receiving an appropriate intervention. The only case where this minimum standard was not met on average was in Rvachew & Nowak (2001), a study in which a strictly traditional articulation therapy approach was implemented at low intensity with no homework component.
  2. The measure that we are calling PCC-difficult might be more sensitive and more ecologically valid for 3 and 6 month intervals. This is percent consonants correct, restricted to potential treatment targets, so those consonants that are produced with less than 60% accuracy at intake. These turn out to be mid- to late-developing frequently misarticulated phonemes, therefore /ŋ,k,ɡ,v,ʃ,ʧ,ʤ,θ,ð,s,z,l,ɹ/ in English and /ʃ,ʒ,l,ʁ/ in French for these samples of 4-year-old children with severe and moderate-to-severe primary speech sound disorders. My impression is that when providing an appropriate intervention an SLP should expect at least a 10% change in these phonemes whether assessed with a broad based single word naming probe or in conversation-in fact a 15% change is closer to the average. This does not mean that you should treat the most difficult sounds first! Look carefully at the effect size data from Rvachew and Nowak (2001): when we treated stimulable phonemes we observed a 15% improvement in difficult unstimulable sounds. You can always treat a variety of phonemes from different levels of the phonological hierarchy as described in a previous blog.
  3. Approximately 10% of 4-year-old children with severe and moderate-to-severe primary speech sound disorders do not improve at all over a 3 to 6 month period, given adequate speech therapy. If a child is not improving, the SLP and the parent should be aware that this is a rare event that requires special attention.
  4. In a previous blog I cited some research evidence for the conclusion that patients treated as part of research trials achieve better outcomes than patients treated in a usual care situation. There is some evidence for that in these data. The group in Rvachew, Nowak and Cloutier that received usual care obtained a lower effect size (d=0.45) in comparison to the group that received an extra experimental intervention (d=1.31). In practical terms this difference meant that the group that received the experimental intervention made four times more improvement in the production of difficult sounds than the control group that received usual care.
  5. The variation in effect sizes that is shown in these data indicate that SLP decisions about treatment procedures and service delivery options have implications for success in therapy. What are the characteristics of the interventions that led to relatively large changes in PCC or relatively large standardized effect sizes? (i) Comprehensiveness, that is the inclusion of intervention procedures that target more than one level of representation, e.g., procedures to improve articulation accuracy and speech perception skills and/or phonological awareness; and (ii) parent involvement, specifically the inclusion of a well-structured and supported home program.

If you see other messages in these data, or have observations from your own practice or research, please write to me in the comments.

 

 

Are effect sizes in research papers useful in SLP practice?

Effect size blog figure 1Effect sizes are now required in addition to statistical significance reporting in scientific reports. As discussed in a previous blog, effect sizes are useful for research purposes because they can be aggregated across studies to draw conclusions (i.e., as, in a meta-analysis). However, they are also intended to be useful as an indication of the “practical consequences of the findings for daily life.” Therefore, Gierut, Morrisette, & Dickinson’s paper “Effect Size for Single-Subject Design in Phonological Treatment” was of considerable interest to me when it was published in 2015. They report the distribution of effect sizes for 135 multiple baseline studies using a pooled standard deviation for the baseline phase of the studies as the denominator and the mean of the treatment phase minus the mean of the baseline phase as the numerator in the equation to calculate the effect size statistic. In these studies, the mean and the variance of probe scores in the baseline phase is restricted to be very small by design, because the treatment targets and generalization probe targets must show close to stable 0% correct performance during the baseline phase. The consequence of this restriction is that the effect size number will be very large even when the raw amount of performance change is not so great. Therefore the figure above shows hypothetical data that yields exactly their average effect size of 3.66 (specifically, [8.57%-1.25%]/.02 = 3.66). This effect size is termed a medium effect size in their paper but I leave it to the reader to decide if a change of not quite 9% accuracy in speech sound production is an acceptable level of change. It may be because in these studies, a treatment effect is operationalized as probe scores (single word naming task) for all the phonemes that were absent from the child’s repertoire at intake. From the research point of view this paper provides very important information: it permits researchers to compare effect sizes and explore variables that account for between-case differences in effect sizes in those cases where the researchers have used a multiple baseline design and treatment intensities similar to those reported in this paper (5 to 19 one-hour sessions typically delivered 3 times per week).

The question I am asking myself is whether the distribution of effect sizes that is reported in this paper is helpful to clinicians who are concerned with the practical significance of these studies. I ask this because I am starting to see manuscripts reporting clinical case studies in which the data are used to claim “large treatment effects” for a single case (using Gierut et al’s standard of an effect size of 6.32 or greater). Indeed, in the clinical setting SLPs will be asked to consider whether their clients are making “enough” progress. For example, in Rvachew and Nowak (2001) we asked parents to rate their agreement with the statement “My child’s communication skills are improving as fast as can be expected.” (This question was on our standard patient satisfaction questionnaire so in fact, we asked every parent this question, not just the ones in this RCT). But the parent responses in the RCT showed that there were significant between group differences in response to this question that aligned with the dramatic differences in child response to the traditional versus complexity approach to target selection that was tested in that study (e.g., 34% vs. 17% of targets mastered in these groups respectively). It seems to me that when a parent asks themselves this question they have multiple frames of reference: not only do they consider the child’s communicative competence before and after the introduction of therapy, they consider whether their child would make more or less change with other hypothetical SLPs and other treatment approaches, given that parents actually have choices about these things. Therefore, an effect size that says effectively, the child made more progress with treatment compared to no treatment is not really answering the parent’s question. However, with a group design it is possible to calculate an effect size that reflects change relative to the average amount of change one might expect, given therapy. To my mind this kind of effect size comes closer to answering the questions about practical significance that a parent or employer might ask.

This still leaves us with the question of what kind of change to describe. It is unfortunate that there are few if any controlled studies that have reported functional measures. I can think of some examples of descriptive studies that reported functional measures however. First, Campbell (1999) reported that good functional outcomes were achieved when preschoolers with moderate and severe Speech Delay received twice-weekly therapy over a 90- to 120-day period (i.e., on average the children’s speech intelligibility improved from approximately 50% to 75% intelligible as reported by parents). Second, there are a number of studies reporting ASHA-NOMS (functional communication measures provided by treating SLPs) for children receiving speech and language therapy. However, Thomas-Stonell et al (2007) found that improvement on the ASHA-NOMS was not as sensitive as parental reports of “real life communication change” over a 3 to 6 month interval. Therefore, Thomas-Stonell and her colleagues developed the FOCUS to document parental reports of functional outcomes in a reliable and standardized manner.

Thomas-Stonell et al (2013) report changes in FOCUS scores for 97 preschool aged children who received an average of 9 hours of SLP service in Canada, comparing change during the waiting period (60 day interval) to change during the treatment period (90 day interval). FOCUS assessments demonstrated significantly more change during treatment (about 18 FOCUS points on average) than during the wait period (about 6 FOCUS points on average). Then they compared minimally important changes in PCC, the Children’s Speech Intelligibility Measure, and FOCUS scores for 28 preschool aged children. The FOCUS measure was significantly correlated with the speech accuracy and intelligibility measures but there was not perfect agreement among these measures. For example, 21/28 children obtained a minimally important change of at least 16 points on the FOCUS but 4 of those children did not show significant change on PCC/CSIM. In other words speech accuracy, speech intelligibility and functional improvements are related but not completely aligned; each provides independent information about change over time.

In controlled studies, some version of percent consonants correct is a very common treatment outcome that is used  to assess the efficacy of phonology therapy. Gierut et al (2015) focused specifically on change in those phonemes that are late developing and produced with very low accuracy, if not completely absent from the child’s repertoire at intake. This strikes me as a defensible measure of treatment outcome. Regardless of whether one chooses to treat a complex sound, an early developing sound, a medium-difficulty sound (or one of each as I demonstrated in a previous blog), presumably the SLP wants to have dramatic effects across the child’s phonological system. Evidence that the child is adding new sounds to the repertoire is a good indicator of that kind of change. Alternatively the SLP might count increases in correct use of all consonants that were potential treatment targets prior to the onset of treatment. Or, the SLP could count percent consonants correct for all the consonants because this measure is associated with intelligibility and takes into account the fact that there can be regressions in previously mastered sounds when phonological reorganization is occurring. The number of choices suggests that it would be valuable to have effect size data for a number of possible indicators of change. More to the point, Gierut et al’s single subject effect size implies that almost any change above “no change” is an acceptable level of change in a population that receives intervention because they are stalled without it. I am curious to know if this is a reasonable position to take. In my next blog post I will report effect sizes for these speech accuracy measures taken from my own studies going back to 2001. I will also discuss the clinical significance of the effect sizes that I will aggregate. I am going to calculate the effect size for paired mean differences along with the corresponding confidence intervals for groups of preschoolers treated in three different studies. I haven’t done the calculations yet, so, for those readers who are at all interested in this, you can hold your breath with me.

References

Campbell, T. F. (1999). Functional treatment outcomes in young children with motor speech disorders. In A. Caruso & E. A. Strand (Eds.), Clinical Management of Motor Speech Disorders in Children (pp. 385-395). New York: Thieme Medical Publishers, Inc.

Gierut, J. A., Morrisette, M. L., & Dickinson, S. L. (2015). Effect Size for Single-Subject Design in Phonological Treatment. Journal of Speech, Language, and Hearing Research, 58(5), 1464-1481. doi:10.1044/2015_JSLHR-S-14-0299

Lakens, D. (2013). Calculating and reporting effect sizes to facilitate cumulative science: A practical primer for t-tests and ANOVAs. Frontiers in Psychology, 4, 1-12. doi:10.3389/fpsyg.2013.00863

Thomas-Stonell, N., McConney-Ellis, S., Oddson, B., Robertson, B., & Rosenbaum, P. (2007). An evaluation of the responsiveness of the pre-kindergarten ASHA NOMS. Canadian Journal of Speech-Language Pathology and Audiology, 31(2), 74-82.

Thomas-Stonell, N., Oddson, B., Robertson, B., & Rosenbaum, P. (2013). Validation of the Focus on the Outcomes of Communication under Six outcome measure. Developmental Medicine and Child Neuroloogy, 55(6), 546-552. doi:10.1111/dmcn.12123

Rvachew, S., & Nowak, M. (2001). The effect of target selection strategy on sound production learning. Journal of Speech, Language, and Hearing Research, 44, 610-623.

 

 

 

Research Engagement with SLPs

I still have days when I miss my former job as a research coordinator in a hospital speech-language department. As a faculty researcher, I try to embed my research in clinical settings as often as I can but it is not easy. Administrators, in particular, and speech-language pathologists on occasion may be leery of the time requirement and often worry that the project might shine too bright a light on every day clinical practices that may not be up to the highest evidence based standard. I always try to design projects that are mutually beneficial to the research team and the clinical setting. As a potential support to the promise of mutual benefit, I was pleased to read a recent paper in the British Medical Journal “Does the engagement of clinicians and organizations in research improve healthcare performance: a three-stage review”. On the basis of an hour-glass shaped review, using an interpretive sythesis of the literature on the topic, Boaz, Hanney, Jones, and Saper drew the following conclusions:

Some papers reported an association between hospital participation in research and improved patient outcomes. Some of these findings were quite striking as for example significantly worse survival from ovarian cancer in “non study hospitals” versus hospitals involved in research trials (my sister-in-law died from this terrible disease this month so I couldn’t help but notice this).

A majority of papers reported an association between hospital participation in research and improved processes of healthcare. This includes the adoption of innovative treatments as well as better compliance with best practice guidelines.

Different causal mechanisms may account for these findings when examining impacts at the clinician versus organization level. For example, involvement in a clinical trial may include staff training and other experiences that change clinician attitudes and behaviors. Higher up, participation in the trial may require the organization to acquire new infrastructure or adopt new policies.

The direction of cause and effect may be difficult to discern. Specifically, a hospital that is open to involvement in research may have a higher proportion of research-active staff who have unique skills, specialization or personal characteristics. These characteristics may jointly improve healthcare outcomes in that setting and that make those staff more amenable to engagement with research.

This last point resonates well with my experience at the Alberta Children’s Hospital in the 80’s and 90’s. The hospital had a very large SLP department, up to 30 SLPs, permitting considerable specialization among us. Furthermore, as a teaching hospital we a had a good network of linkages to the two universities in the province and to a broad array of referral sources. Our working model, that was based on multidisciplinary teams, also supported involvement in research. Currently, in Montreal I am able to set up research clinics in healthcare and educational settings from time to time, but none of them have the resources that we enjoyed in Alberta three decades ago.

Of course, direct involvement in research is not the only way for SLPs to engage with research evidence. Another paper, published in Research in Developmental Disabilities used a survey to explore “Knowledge acquisition and research evidence in autism.” Carrington et al found that researchers and practitioners had somewhat different perspectives. The researcher group (n=256) and the practitioner group (n=422) identified sources of information that they used to stay up to date with current information on autism. Researchers were more likely to identify scientific journals and their colleagues whereas practitioners were more likely to identify conferences/PD workshops and non-academic journals. Respondents also identified sources of information that they thought would help practitioners translate research to practice. Researchers thought that nontechnical summaries and interactions with researchers would be most helpful. Practitioners identified academic journals as the best source of information (but the paper doesn’t explain why they were not using these journals as their primary source).

Finally, the most interesting finding for me was that both groups did not use or suggest social media as a helpful source of information. I thought this was odd because social media is a potential access point to academic journal articles or summaries of those articles as well as a way of interacting directly with scientists.

The authors concluded that knowledge translation requires that practitioners be engaged with research and researchers. For that to happen they suggest that “research should focus on priority areas that meet the needs of the research-user community” and that “attempts to bridge the research-practice gap need to involve greater collaboration between autism researchers and research-users.”

Given that the research shows that the involvement of practitioners in research actually improves care and outcomes for our  clients and patients, I would say that it is past time to bring down barriers to researcher-SLP collaboration and bring research right into the clinical setting.

Maternal Responsiveness to Babbling

Over the course of my career the most exciting change in speech-language pathology practice has been the realization that we can have an impact on speech and language development by working with the youngest patients, intervening even before the child “starts to talk”. Our effectiveness with these young patients is dependent upon the growing body of research on the developmental processes that underlie speech development during the first year of life. Now that we know that the emergence of babbling is a learned behavior, influenced by auditory and social inputs, this kind of research has mushroomed although our knowledge remains constrained because these studies are hugely expensive, technically difficult and time consuming to conduct. Therefore I was very excited to see a new paper on the topic in JSLHR this month:

Fagan, M. K., & Doveikis, K. N. (2017). Ordinary Interactions Challenge Proposals That Maternal Verbal Responses Shape Infant Vocal Development. Journal of Speech, Language, and Hearing Research, 60(10), 2819-2827. doi:10.1044/2017_JSLHR-S-16-0005

The purpose of this paper was to examine the hypothesis that maternal responses to infant vocalizations are a primary cause of the age related change in the maturity of infant speech during the period 4 through 10 months of age. This time period encompasses three stages of infant vocal development: (1) expansion stage, that is producing vowels and a broad variety of vocalizations that are not speech like but nonetheless exercise vocal parameters such as pitch, resonance and vocal tract closures; (2) canonical babbling stage, that is producing speech like CV syllables, singly or in repetitive strings; and, (3) integrative stage, that is producing a mix of babbling and meaningful words. In the laboratory, contingent verbal responses from adults increase the production rate of mature syllables by infants. Fagan and Doveikis asked whether this shaping mechanism, demonstrated in the laboratory, explains the course of infant speech development in natural interactions in real world settings. They coded 5 and a quarter hours of natural interactions recorded between mothers and infants in the home environment from 35 dyads in a cross-sectional study. Their analysis focuses on maternal behaviors in the 3 second interval following an infant vocalization, defined as a speech-like vowel or syllable type utterance. They were specifically interested to know whether maternal vocalizations in this interval would be responsive (prompt, contingent, relevant to the infant’s vocal behavior, e.g., affirmations, questions, imitations) or nonresponsive (prompt but not meaningfully related to the infant’s vocal behavior, e.g., activity comment, unrelated comment, redirect). This is a summary of their findings:

  • Mothers vocalized 3 times more frequently than infants.
  • One quarter of maternal vocalizations fell within the 3 sec interval after an infant vocalization.
  • About 40% of the prompt maternal vocalizations were responsive and the remainder were nonresponsive, according to their definitions derived from Bornstein et al., 2008).
  • Within the category of responsive maternal vocalizations, the most common were questions and affirmations.
  • A maternal vocalization of some kind occurred promptly after 85% of all infant utterances.
  • Imitations of the infant utterance (also in the responsive category) occurred after approximately 11% of infant utterances (my estimate from their data).
  • Mothers responded preferentially to speech-like vocalizations but not differentially to CV syllables versus vowel-only syllables. In other words, it did not appear that maternal reinforcement or shaping of mature syllables could account for the emergence and increase in this behavior with infant age.

One reason I like this paper so much is that some of the results accord with data that we are collecting in my lab in a project coordinated by my doctoral student Pegah Athari who is showing great skill and patience, having worked her way through through 10 hours of recordings from 5 infants in a longitudinal study (3 months of recording from each infant but covering ages 6 through 14 months overall). The study is designed to explore mimicry specifically as a responsive utterance that may be particularly powerful (mimicry involves full or partial imitation of the preceding utterance). We want to be able to predict when mimicry will occur and to understand its function. In our study we examine the 2 second intervals that precede and follow each infant utterance. Another important difference is that we record the interactions in the lab but there are no experimental procedures, we arrange the setting and materials to support interactions that are as naturalistic as possible. These are some of our findings:

  • Mothers produced 1.6 times as many utterances as their infants.
  • Mothers said something after the vast majority of the infant’s vocalizations just as observed by Fagan and Doveikis.
  • Instances in which one member of the dyad produced an utterance that is similar to the other were rare, but twice as common in the direction of mother mimicking the infant (10%), compared to the baby mimicking the mother (5%).
  • Infant mimicry of the mother is significantly (but not completely) contingent on the mother modeling one of the infant’s preferred sounds in her utterance (mean contingency coefficient = .34).
  • Maternal mimicry is significantly (but not completely) contingent on perceived meaningfulness of the child’s vocalization (mean contingency coefficient = .35). In other words, it seems that the mother is not specifically responding to the phonetic character of her infant’s speech output; rather, she makes a deliberate attempt to teach meaningful communication throughout early development.
  • The number of utterances that the mother perceives to be meaning increase with the infant’s age although this is not a hard and fast rule because regressions occur when the infant is ill and the canonical babbling ratio declines. Mothers will also respond to nonspeechlike utterances in the precanonical stage as being meaningful (animal noises, kissing and so forth).

We want to replicate our findings with another 5 infants before we try to publish our data but I feel confident that our conclusions will be subtly different from Fagan and Doveikis’ despite general agreement with their suggestion that self-motivation factors and access to auditory feedback of their own vocal output plays a primary role in infant vocal development. I think that maternal behavior may yet prove to have an important function however. It is necessary to think about learning mechanisms in which low frequency random inputs are actually helpful. I have talked about this before on this blog in a post about the difference between exploration and exploitation in learning. Exploration is a phase during which trial and error actions help to define the boundaries of the effective action space and permit discovery of actions that are most rewarding. Without exploration one might settle on a small repertoire of actions that are moderately rewarding and never discover others that will be needed as your problems become more complex. Exploitation is the phase during which you use the actions that you have learned to accomplish increasingly complex goals.

The basic idea behind the exploration-exploitation paradox is that long term learning is supported by using an exploration strategy early in the learning process. Specifically, many studies have shown that more variable responding early in learning is associated with easier learning of difficult skills later in the learning process. For early vocal learning, the expansion stage corresponds to this principle nicely: the infant produces a broad variety of vocalizations—squeals, growls, yells, raspberries, vowels, quasiresonants, fully resonant vowels and combinations called marginal babbles. These varied productions lay the foundations for the production of speech like syllables during the coming canonical babbling stage. Learning theorists have demonstrated that environmental inputs can support this kind of free exploration. Specifically, a high reinforcement rate will promote a high response rate but it is important to reinforce variable responses early in the learning process.

In the context of mother-infant interactions, it may be that mothers are reinforcing many different kinds of infant vocalizations in the early stages because they are trying to teach words but the infant is not really capable of producing real words and she has to work with what she hears. She does do something after almost every infant utterance however so she encourages many different practice trials on the part of the infant. It is also possible (although not completely proven) that imitative responses on the part of the mother are particularly reinforcing to the infant. In the short excerpt of a “conversation” between a mum and her 11 month old infant shown here, it can be seen that she responds to every one of the infant’s utterances, encouraging a number of variable responses, specifically mimicking those that are most closely aligned with her intentions.

IDV11E03A EXCERPT

It is likely that when alone in the crib, the infant’s vocalizations will be more repetitive, permitting more specific practice of preferred phonetic forms such as “da” (infants are known to babble more when alone than in dyadic interactions, especially when scientists feed back their vocalizations over loud speakers). The thing is, the infant’s goals are not aligned with the mothers. In my view, the most likely explanation for infant vocal learning is self-supervised learning. The infant is motivated to produce specific utterances and finds achievement of those utterances to be intrinsically motivating. What kind of utterances does the infant want to produce? Computer models of this process have settled on two factors: salience and learning progress. That is, the infant enjoys producing sounds that are interesting and that are not yet mastered. The mother’s goals are completely different (teach real words) but her behaviors in this regard serve the infant’s goals nonetheless by: (1) supporting perceptual learning of targets that correspond to the ambient language; (2) encouraging sound play/practice by responding to the infant’s attempts with a variety of socially positive behaviors; (3) reinforcing variable productions by modeling a variety of forms and accepting a variety of attempts as approximations of meaningful utterances when possible; and (4) increasing the salience of speech-like utterances through mimicry of these rare utterances. The misalignment of the infant’s and the mother’s goals is helpful to the process because if the mother were trying to teach the infant specific phonetic forms (CV syllables for example), the exploration process might be curtailed prematurely and self-motivation mechanisms might be hampered.

What are the clinical implications of these observations? I am not sure yet. I need a lot more data to feel more confident that I can predict maternal behavior in relation to infant behavior. But in the meantime it strikes me that SLPs engage in a number of parent teaching practices that assume that responsiveness by the parent is a “good thing”. However, it is not certain that parents typically respond to their infant’s vocalizations in quite the ways that we expect. In the mean time, procedures to encourage vocal play are a valuable part of your tool box, as described in Chapter 10 of our book:

Rvachew, S., & Brosseau-Lapre, F. (2018). Developmental Phonological Disorders: Foundations of Clinical Practice (Second ed.). San Diego, CA: Plural Publishing, Inc.

 

Testing Client Response to Alternative Speech Therapies

Buchwald et al published one of the many interesting papers in a recent special issue on motor speech disorders in the Journal of Speech, Language and Hearing Research. In their paper they outline a common approach to speech production, one that is illustrated and discussed in some detail in Chapters 3 and 7 of our book, Developmental Phonological Disorders: Foundations of Clinical Practice. Buchwald et al. apply it in the context of Acquired Apraxia of Speech however. They distinguish between patients who produce speech errors subsequent to left hemisphere cardiovascular accident as a consequence of motor planning difficulties versus phonological planning difficulties. Specifically, in their study there are four such patients, two in each subgroup. Acoustic analysis was used to determine whether their cluster errors arose during phonological planning or in the next stage of speech production – during motor planning. The analysis involves comparing the durations of segments in triads of words like this: /skæmp/ → [skæmp], /skæmp/ → [skæm], /skæm/ → [skæm]. The basic idea is that if segments such as [k] in /sk/ → [k] or [m] in /mp/ → [m] are produced as they would be in a singleton context, then the errors arise during phonological planning; alternatively, if they are produced as they would be in the cluster context, then the deletion errors arise during motor planning. This leads the authors to hypothesize that patients with these different error types would respond differently to intervention. So they treated all four patients with the same treatment, described as “repetition based speech motor learning practice”. Consistent with their hypothesis, the two patients with motor planning errors responded to this treatment and the two with phonological planning errors did not as shown in the table of pre- versus post-treatment results.

Buchwald et al results corrected table

However, as the authors point out, a significant limitation of this study is that the design is not experimental. Having failed to establish experimental control either within or across speakers it is difficult to draw conclusions.

I find the paper to be of interest on two accounts nonetheless. Firstly, their hypothesis is exactly the same hypothesis that Tanya Matthews and I posed for children who present with phonological versus motor planning deficits. Secondly, their hypothesis is fully compatible with the application of a single subject randomization design. Therefore it provides me with an opportunity to follow through with my promise from the previous blog, to demonstrate how to set up this design for clinical research.

For her dissertation research, Tanya identified 11 children with severe speech disorders and inconsistent speech sound errors who completed our full experimental paradigm. These children were diagnosed with either a phonological planning disorder or a motor planning disorder using the Syllable Repetition Task and other assessments as described in our recently CJSLPA paper, available open access here. Using those procedures, we found that 6 had a motor planning deficit and 5 had a phonological planning deficit.

Then we hypothesized that the children with motor planning disorders would respond to a treatment that targeted speech motor control: much like Brumbach et al., it included repetition practice according to the principles of motor practice during the practice parts of the session but during prepractice, children were taught to identify the target words and to identify mispronunciations of the target words so that they would be better able to integrate feedback and self-correct during repetition practice. Notice that direct and delayed imitation are important procedures in this approach. We called this the auditory-motor integration (AMI approach).

For children with Phonological Planning disorders we hypothesized that they would respond to a treatment similar to the principles suggested by Dodd et al (i.e., see core vocabulary approach). Specifically the children are taught to segment the target words into phonemes, associating the phonemes with visual cues. Then we taught the children to chain the phonemes back together into a single word. Finally, during the practice component of each session, we encouraged the children to produce the words using the visual cues when necessary. An important component of this approach is that auditory-visual models are not provided prior to the child’s production attempt-the child is forced to construct the phonological plan independently. We called this the phonological memory & planning (PMP) approach.

We also had a control condition that consisted solely of repetition practice (CON condition).

The big difference between our work and Brumbach et al. is that we tested our hypothesis using a single subject block randomization design, as described in our recent tutorial in Journal of Communication Disorders. The design was set up so that each of the 11 children experienced all three treatments. We chose 3 treatment targets for each child, randomly assigned the targets to each of the three treatments, and then randomly assigned the treatments to each of three sessions, scheduled to occur on different days of the week, 3 sessions per week for 6 weeks. You can see from the table below that each week counts as one block, so there are 6 blocks of 3 sessions for 18 sessions in total. The randomization scheme was generated blindly and independently using computer software for each child. The diagram below shows the treatment schedule for one of the children with a motor planning disorder.

Block Randomization TASC02 DPD Blog

This design allowed us to compare response to the three treatments within each child using a randomization test. For this child, the randomization test revealed a highly significant difference in favour of the AMI treatment as compared to the PMP treatment, as hypothesized for children with motor planning deficits. I don’t want to scoop Tanya’s thesis because she will finish it soon, before the end of 2017 I’m sure, but the long and the short of it is that we have a very clear results in favour of our hypothesis using this fully experimental design and the statistics that are licensed by it. I hope you will check out our tutorial on the application of this design: we show how flexible and versatile this design can be for addressing many different questions about speech-language practice. There is much exciting work being done in the area of speech motor control and this is a design that gives researchers and clinicians an opportunity to obtain interpretable results with small samples of children with rare or idiosyncratic profiles.

Reading

Buchwald, A., & Miozzo, M. (2012). Phonological and Motor Errors in Individuals With Acquired Sound Production Impairment. Journal of Speech, Language, and Hearing Research, 55(5), S1573-S1586. doi:10.1044/1092-4388(2012/11-0200)

Rvachew, S., & Matthews, T. (2017). Using the Syllable Repetition Task to Reveal Underlying Speech Processes in Childhood Apraxia of Speech: A Tutorial. Canadian Journal of Speech-Language Pathology and Audiology, 41(1), 106-126.

Rvachew, S., & Matthews, T. (2017). Demonstrating treatment efficacy using the single subject randomization design: A tutorial and demonstration. Journal of Communication Disorders, 67, 1-13. doi:https://doi.org/10.1016/j.jcomdis.2017.04.003