Feedback Errors in Speech Therapy

I have been spending hours reviewing video of student SLPs (SSLPs) conducting speech therapy sessions, looking for snippets to take to my upcoming talks at ASHA2018. The students are impressively skilled with a very difficult CAS population but after this many hours of watching, repeated examples of certain categories of errors pile up in the provision of feedback to children about their attempts to produce the targets words, phrases and sentences. I am going to provide some examples here with commentary. In no way am I meaning any disrespect to the students because it is my experience that the average person becomes an idiot when a camera is pointed at them. I recall hearing about studies on the “audience effect” as an undergraduate – the idea is that when your skills are shaky you get worse when someone is watching but when your skills are excellent an audience actually enhances them. My social psychology prof said this even works for cockroaches! I can’t vouch for that but it certainly works for speech pathologists. I remember one time video-taping a session that was required for a course – I thought it went really well so I gave a copy to the parents and the course instructor. Later when watching it I could see clearly that for the whole half hour the child was trying desperately and without success to tell me that I was calling him by the wrong name (I had mixed him up with his twin brother whom I was also treating). I was oblivious to this during the live session but it was clear on the video. Anyway, these examples are not reflections on the students’ skill levels overall but they are examples of common feedback errors that I see in novice and experienced SLPs. Interestingly the clinical educators (CEs) who were supervising these sessions rarely mentioned this aspect of the students’ practice. Readers may find this blog useful as a template for reviewing student practice.

Category 1: No feedback

Child: [repeats 5 different sentences containing the target /s/ cluster words]

SSLP: [Turns to CE.] “What did you get?” [This is followed by 1 minute and 40 seconds of conversation about the child’s level of accuracy and strategies to improve it on the next block of trials.]

SSLP: [Turns back to child.] “You need to sit up. You got 2 out 5 correct. Now we’re going to count them on my fingers…”

Child: “Do we have to say these?”

Comment on vignette: In this case the SSLP did finally give feedback but too late for it to be meaningful to the child and after the telling the child off for slouching in her chair! Other variants on this are taking notes about the child’s performance or turning to converse with the child’s parent or getting caught up in the reinforcement game and forgetting to provide feedback. In CAS interventions it is common to provide feedback on a random schedule or to provide summative feedback after a block of trials. However, the child should be able to predict the block size and have information about whether their performance is generally improving or not. Even if the child does not have a count of number or percent trials correct, the child should know that practice stimuli are getting more difficult, reflecting performance gains. Sometimes, we deliberately plan to not provide feedback because we want the child to evaluate his or her own productions, but in these cases the child is told beforehand and the child is given a means of explicitly making that judgment (e.g., putting token in jar). Furthermore, the SSLP would be expected to praise the child for making accurate self-judgments or self-corrections. When the child does not get feedback or cannot track their own progress they will lose interest in the activity. It is common for SSLPs to change the game thinking that it is not motivating enough but there is nothing more motivating than a clear sense of success!

Possible solutions: Video record sessions and ask students to watch for and count the frequency of events in which the child has not received expected feedback. Provide child with visual guides to track progress indexed either as correct trials or difficulty of practice materials.

Category 2: Ambiguous feedback

SSLP: “Say [ska].”

Child: “[skak]”

SSLP: “OK, take the fish out.”

Comment on the vignette: In this case it is not clear if the SSLP is accepting the inexact repetition of her model. In our CAS interventions we expect the child to produce the model exactly because metathesis and other planning errors are common and therefore I would consider this production to be incorrect. Other ambiguous feedback that I observed frequently were “Good try” and “Nice try” and similar variants. In these cases the child has not received a clear signal that the “try” was incorrect. Another version of ambiguous feedback is to comment on the child’s behavior rather than the child’s speech accuracy (e.g., “You did it by yourself!” in which case the “it” is ambiguous to the child not clearly related to the accuracy of the child’s speech attempts).

Possible solutions: SSLPs really do not like telling children that have said something incorrectly. Ask students to role play firm and informative feedback. Have the students plan a small number of clear phrases that are acceptable to them as indicators of correct and incorrect responses (e.g., “I didn’t hear your snake sound” may be more acceptable than “No, that’s wrong”). Post written copies of the phrases somewhere in the therapy room so that the SLP can see them. Track the use of vague phrases such as “nice try” and impose a mutually agreed but fun penalty for exceeding a threshold number (buy the next coffee round for example). This works well if students are peer coaching.

Category 3: Mixed signals

SSLP: “Say [ska].”

Child: “[s:ka]”

SSLP: “Good job! Take the fish out.” [Frown on face].

Comment on the vignette: I am rather prone to this one myself due to strong concentration on next moves! But it is really unhelpful for children with speech and language delays who find the nonverbal message much easier to interpret than the verbal message.

Possible solutions: It would be better if SLP therapy rooms looked like a physiotherapy room. It annoys the heck out of me when we can’t get them outfitted with beautiful wall to ceiling mirrors. The child and SLP should sit or stand in front of the mirror when working on speech. Many games can be played using ticky tack or reusable stickers or dry erase pens. The SLP will be more aware of the congruence or incongruence between facial expressions, body language and verbal signals during the session.

Category 4: Feedback that reinforces the error

SSLP: “Repeat after me, Spatnuck” [this is the name of a rocket ship in nonsense word therapy].

Child: “fatnuck”

SSLP: “I think you said fatnuck with a [f:] instead of a [s:].

Comment on the vignette: Some SSLPs provide this kind of feedback so frequently that the child hears as many models of the incorrect form as the correct form. This is not helpful! This kind of feedback after the error is not easy for young children to process. To help the child succeed, it would be better to change the difficulty level of the task itself and provide more effective support before the next trial. After attempts, recasting incorrect tries and imitating correct tries can help the child monitor their own attempts at the target.

Possible solutions: Try similar strategies as suggested for ambiguous feedback. Plan appropriate feedback in advance. Plan to say this when the incorrect response is heard: “I didn’t hear the snake sound. Let’s try just the beginning of the word, watch me: sss-pat.” And when “spat” is achieved, plan to say “Good, I heard spat, you get a Spatnuck to put in space.”

Category 5: Confused feedback

SSLP: “Oh! Remember to curl your tongue when you say shadow.”

SSLP: “Oh! You found another pair.”

Child: “It’s shell [sʷɛo].”

SSLP: “Oh! I like the way you rounded your lips. Where is your tongue? Remember to hide your tongue.”

SSLP: Oh! You remembered where it was. You found another pair.”

Child: “Shoes [sʷuz].”

SSLP: “Oh! I like the way you rounded your tongue.”

Comment on vignette: In this vignette the SSLP is providing feedback about three aspects of the child’s performance-finding pairs when playing memory, rounding lips when attempting “sh” sounds, and in some cases anterior tongue placement when attempting the “sh” sound as well. One aspect of her feedback that is confusing when watching the video is the using of the exclamation “Oh!” Initially it appeared to signal an upcoming correction but it became so constant that it was not a predictable signal of any kind of feedback and was confusing. The exclamation had a negative valence to it but it might precede a correction or positive feedback. The SSLP confused her feedback about lips and tongue and it was not clear whether she was expecting the child to achieve the correct lip gesture, the correct tongue gesture or both at the same time.

Possible solutions: This can happen when there is too much happening in a session. The CE could help the SSLP restructure the session so that she can focus her attention on one aspect of the child’s behavior at a time, like this: “I want you to name these five pictures. Each time I am going to watch your lips. When you are done you can put the pictures on the table and mix them up for our game later.” If the child rounds the lips each time, switch to focusing on the tongue. When the ten cards are on the table play memory, modeling the picture names. In this way the three behaviors (rounding lips, retracting tongue, finding pairs) are separated in time and the SSLP can focus attention on each one with care, providing appropriate feedback repeatedly during the appropriate intervals.

Category 6: Confused use of reinforcement materials

SSLP: “Repeat after me, [ska].”

Child: “[θak]”

SSLP: [ska]

Child: “[θak]”

SSLP: “OK, take the fish out.”

SSLP: “Repeat after me, [ska].”

Child: [ska]

SSLP: “There you got it, take the fish out.”

SSLP: “Repeat after me, [ska].”

Child: [ska]

SSLP: “Good, and the last one, [ska].”

Child: [ska]

SSLP: “That’s good, take the fish out.”

Comment on vignette: In this vignette the child cannot tell if he gets a fish for correct answers or wrong answers or any answer. It is even worse if the child has been told that he will get a fish for each correct answer. Sometimes a student will say “Everything was going fine, we were having fun and then he just lost it!” When you look at the video you see exchanges such as the one reproduced here leading up to a tantrum by the child. The SSLP has broken a promise to the child. They don’t forgive that.

Possible solutions: This one is hard because it is a classic rooky mistake. Experience is the best cure. Reducing the number of tasks that the SSLP must do simultaneously may help. Therefore, in the early sessions the CE might keep track of the child’s correct and incorrect responses for the SSLP and allow her to focus on managing the materials and the child’s behavior. SSLPs would never think of this but it is possible to let the child manage the reinforcement materials themselves in some cases. One of our favorite vignettes, reprinted on page 463 of DPD2e (Case Study 9-4) involved an error detection activity in which the child could put toy animals in the barn but only when the SSLP said the names of the animals correctly. The child had the toys in his hands throughout the activity. He would not put them in the barn unless the clinician said the words correctly and would get annoyed if she said them wrong, telling her “you have to say cow [kau]!” SSLPs can learn that it is not necessary to control everything.

I put these here for students and clinical educators and speech-language pathologists and hope that you will have fun finding these feedback mishaps in your own sessions. If you come up with better strategies to avoid them than I have suggested here please share them in the comments.

Advertisements

L2 learning of new phonetic contrasts: How hard is that?

Given my career long interest in the impact of perceptual knowledge on speech production learning it was gratifying to read the meta-analysis by Sakai and Moorman(2018) that concluded “Ultimately, the present meta-analysis was able to show that perception-only training can lead to production gains. This finding is encouraging to L2 instructors and learners.” They found that, when teaching adult L2 learners to perceive a foreign language phonetic contrast, an average effect size of .92 was obtained for gains in perception while an average effect size of .54 was obtained for gains in production accuracy for the same phonetic contrast. They reviewed studies in which no production practice or training was provided and therefore these changes in production accuracy were a direct effect of the perceptual training procedure.

Francoise and I were puzzled by one part of their paper however. They excluded 12 studies because they claimed that they were unable to obtain the data required for the calculation of effect sizes from the authors. Our prior study on training English speakers to perceive French vowels was excluded on this basis despite the fact that I am right here with the data all neat and tidy in spread sheets (I guess these things happen although it is the second time that this has happened to me now so it is getting to be annoying). Nonetheless, we have all the data required to calculate those effect sizes so I provide them here for each group. There were seven conditions in our study, manipulating variability in the talkers (multiple versus single talker) and the position of the training vowels on the continuum (far from the category boundary, prototypical location in category space, and close to the category boundary). Given a control condition in which listeners categorized grammatical items rather than vowel tokens, we have seven conditions: control (CON), single voice prototype (SVP), multiple voice prototype (MVP), single voice far (SVF), multiple voice far (MVF), single voice close (SVC), multiple voice close (MVC). I provide the effect sizes below calculated as described by Sakai et al. The figures our paper reflect our statistical analysis which indicate a reliable effect of the training on perception in the MVF and MVC conditions but to our disappointment no reliable effect on production. All groups improved production (including the control group) when acoustic measures were considered, but these acoustic changes were not perceptible to native French listeners, as in there was no significant effect of time (pre to post training) or condition and no interaction when we submitted listener ratings of the participants’ production effects to a repeated measures ANOVA. Nonetheless, some moderate effect sizes are seen below in the SVC and MVC conditions, relative to the CON condition. Two effect sizes are reported for the perception outcomes and the production outcomes: ES(PP) which reflects pre- to post-training changes and ES(PPC) which reflects the difference between the change observed in the experimental group versus the control group.

Brosseau Lapre et al Applied Psycholinguistics perception ES

This table reports similar findings to those described by Sakai and Moorman in that the ES(PPC) is considerably larger than the ES(PP) although this latter ES is smaller than the mean ES that they found. However, we had not expected all of our conditions to be equally effective. Overall, we concluded that training with multiple talkers was most effective when listeners were presented with a range of vowel stimuli that were far from the category boundary; however, training with a single voice was most effective when listeners were presented with vowel stimuli that were close to the category boundary.

Regarding production outcomes however, the figures with error bars in our paper really tell the story most clearly, showing no reliable effect of the perception training on production outcomes from the listener’s perceptive despite large changes in acoustic parameters for all groups, including the control group, suggesting that the perception testing alone (as conducted pre and post training) has at least a short term effect on production.

Brosseau Lapre et al Applied Psycholinguistics production ES

With respect to changes in production, these effect size data might suggest an effect of perception training on production in the single voice close and multiple voice close conditions but overall there were no statistically significant findings for production accuracy and despite small improvements in the SVC and MVC conditions, the average ratings are not good. This is one issue I always have with meta-analyses: they are concerned with the size of “effects” as measured by d but the d values do not tell you whether the “effects” in any of the studies so aggregated were actual effects as in statistically or even functionally significant. Now, theoretically, if you aggregate a lot of moderate effect sizes from a lot of underpowered studies, they could add up to something but in this case I think we have a picture of an effect that is very idiosyncratic. We really don’t know why some L2 participants learn these contrasts and some don’t in the perception or production domains. Sakai and Moorman do us a service by exploring some potential sources of heterogeneity in outcomes. It is possible that many  training sessions targeting about 3 contrasts over a three hour total training period and completed at home may be optimal. Furthermore, in terms of participant characteristics, beginners make more progress than intermediate level learners. Overall however, the characteristics of successful versus unsuccessful learners are not clear despite a growing number of studies that examine underlying perceptual and cognitive skills as predictors. Personally, I find it a bit discouraging to read the Sakai and Moorman paper. The authors were quite excited to find a reliable moderate effect size across a growing number of studies. But I know that those effect sizes are associated with a lot of people who cannot produce a foreign language speech sound that sounds even half-ways “native-like.” After forty years of work involving quite sophisticated methods for designing stimuli and training regimens I thought we would be further along. Definitely more work to do on this problem.

 

 

Words are where it’s at

It is probable that you have seen at least references to the Sperry, Sperry & Miller (2018) paper in Child Development because it made a big splash in the media–the claim of “challenging” Hart & Risley’s (1992/1995) finding of a “30 million word gap” in language input to children with “poor” versus “professional” parents caused a lot of excitement. You may not have seen the follow-up commentary by Golinkoff, Hoff et al and then then the reply by Sperry et al. I want to say something about vocabulary and phonological processing with these papers as a jumping off point which means I have to summarize their debate as efficiently as I can–no easy task because there is a lot going on in those papers but here is the short version:

  1. Sperry, Sperry & Miller (paradoxically) present a very good review of the literature showing that cross-culturally, language input provided directly TO children (i.e., the words that the child hears) predicts language outcomes. This is true even though there is a lot of variation in how much language is directed to children (versus spoken in the vicinity of children) across different cultures and SES strata: regardless of those differences, it is the child-directed speech that matters to rate of language development. Nonetheless, they make the point that overheard speech is understudied and present data indicating that children in poor families across a variety of different ethnic communities might hear and overhear as much speech as middle-class children. Their study is not at all like the Hart & Risley study and therefore the media claims of a “failure to replicate” are inappropriate and highly misleading.
  2. Golinkoff, Hoff et al reiterate the argument that they have been making for decades: children do well in school when they have good language skills; language skills are driven by the quantity and quality of child-directed inputs provided. The focus on the 30 million word gap has led to the development of effective parenting practices and interventions and a de-emphasis on Hart & Risley’s findings would be harmful to children in lower SES families (note that no one is arguing that all poor children receive inadequate inputs or that all racialized children are poor either, these are straw-man arguments).
  3. Sperry et al reply to this comment by saying “Based on the considerable research already cited here and in our study, we assert that it is a mistake to claim that any group has poor language skills simply because their skills are different. Furthermore, we believe that as long as the focus remains on isolated language skills (such as vocabulary) defined by mainstream norms, testing practices, and curricula, nonmainstream children will continue to fail. We believe that low-income, working class, and minority children would be more successful in school if pedagogical practices were more strongly rooted in a strengths-based approach…”

We can all get behind a call for culturally sensitive and fair tests I am sure. As speech-language pathologists we are very motivated to take a strengths-based approach to assessment as well. It is also important to understand that when mothers are talking with their children, they are not transmitting words alone, but also culture. Richman et al. (1992) describe how middle-class mothers in Boston engaged their infants in “emotionally arousing conversational interactions” whereas Gusii mothers “see themselves as protecting their infants” and focused on soothing interactions that moderated emotional excitement; in this same paper, increased maternal schooling was observed to be associated with increased verbal responsiveness to infants by Mexican mothers when compared to mothers from the same community with less education. Therefore, encouraging a “western” style of mother-infant vocal interaction may well conflict with the maternal role of enculturating her infant to valid social norms that differ from western or mainstream values. The call to respect those cultural norms, reflected in Sperry et al’s reply obviously deserves more serious consideration than shown in Golinkoff et al’s urgent plea to maximize vocabulary size.

Nonetheless, Sperry et al are engaging in some wishful thinking when they claim that “young children in societies where they are seldom spoken to nonetheless attain linguistic milestones at comparable rates.” In fact, the only evidence they point to in support of this claim pertains to pointing as a form of nonverbal communication. While it is evidently true that culture is a strong determinant of mother-infant interactional style, it makes no sense to argue that differences in the style of interaction and the amount of linguistic input make no difference to language learning. Teaching interactions vary with culture but learning mechanisms do not (unless you are arguing that there are substantial genetic variations in neurolinguistic mechanisms across ethnic groups and I am absolutely not arguing that, the complete opposite). When Linda Polka and I were studying selective attention in infant speech perception development we talked about speech intake as opposed to speech input. Certainly there may be different ways to engage the infant’s attention but ultimately the amount and quality of linguistic input that the child actively receives will impact the time course of language development. Immigrant parents may not be aiming for western middle class outcomes for their children but when they are, tools to increase vocabulary size in the majority language will be essential.

The other part of Sperry et al’s argument is that children who are not middle class speakers of English in North America might have strengths in other aspects of language (story telling, for example) that must be valued. Vocabulary is deemed to be an isolated skill. This is the part of their argument that I find to be most problematic. Vocabulary is central to all aspects of language learning: phonology and phonological processing, morphology, and syntax, in the oral and written domains. Words are the heart and soul of language and language learning. It is difficult to understand how the child could achieve excellence as a story teller without a good vocabulary. Furthermore, vocabulary is not learned in isolation from all those other aspects of language including the social, pragmatic and cultural. For those children receiving speech-language pathology services, a large vocabulary is protective: if the child for whatever reason has phonological or language processing deficits that make it difficult to learn phonological awareness or decoding skills or morphology or syntax, a large vocabulary can help compensate for those weaknesses. For a speech-language pathologist, a strengths-based perspective may well mean engaging all the people in the child’s environment to build on the child’s vocabularies in the home and school languages as a means of compensating for difficulties in these other areas of language. More typically, what I see is a narrow focus on phonological awareness or morphology or syntax because these skills are weaker and presumably more “important.” But vocabulary is one area where nonprofessionals, paraprofessionals and other professionals can make a huge difference and what a difference it makes!

Further to this topic I am adding below an excerpt from our book Rvachew & Brosseau-Lapré (2018) along with the associated Figure. I also recommend papers by Noble and colleagues on the neurocognitive correlates of reading (an effect that I am sure is also mediated by vocabulary size).

“Vocabulary skills may be an area of relative strength for children with DPD and therefore it may seem unnecessary to teach their parents to use dialogic reading techniques to facilitate their child’s vocabulary acquisition. If the child’s speech is completely unintelligible, low average vocabulary skills are not likely to be the SLP’s highest priority and with good reason! However, good vocabulary skills may be a protective factor for children with DPD with respect to literacy outcomes. Rvachew and Grawburg (2006) conducted a cluster analysis based on the speech perception, receptive vocabulary, and phonological awareness test scores of children with DPD. The results are shown graphically in Figure 9–2. In this figure, receptive vocabulary (PPVT–III) standard scores are plotted against speech perception scores (SAILS; /k/, /s/, /l/, and /ɹ/ modules), with different markers for individual children in each cluster. The figure legend shows the mean phonological awareness (PA) test score for each cluster. The normal limits for PPVT performance are between 85 and 115. The lower limit of normal performance on the SAILS test is a score of approximately 70% correct. Clusters 3 and 4 achieved a mean PA test score within normal limits (i.e., a score higher than 15), whereas Clusters 1 and 2 scored below normal limits on average. The figure illustrates that the children who achieved the highest PA test scores had either exceptionally high vocabulary test scores or very good speech perception scores. The cluster with the lowest PPVT–III scores demonstrated the poorest speech perception and phonological awareness performance. These children can be predicted to have future literacy deficits on the basis of poor language skills alone (Peterson, Pennington, Shriberg, & Boada, 2009). The contrast between Clusters 2 and 3 shows that good speech perception performance is the best predictor of PA for children whose vocabulary scores are within the average range. All children with exceptionally high vocabulary skills achieved good PA scores, however, even those who scored below normal limits on the speech perception test. The mechanism for this outcome is revealed by studies that show an association between vocabulary size and language processing efficiency in 2-year-old children that in turn predicts language outcomRBL Figure 9-2es in multiple domains over the subsequent 6 years (Marchman & Fernald, 2008). Individual differences in processing efficiency may reflect in part endogenous variations in the functioning of underlying neural mechanisms; however, research with bilingual children shows that the primary influence is the amount of environmental language input. Greater exposure to language input in a given language “deepens language specific, as well as language-general, features of existing representations [leading to a] synergistic interaction between processing skills and vocabulary learning” (Marchman & Fernald, 2008, p. 835). More specifically, a larger vocabulary size provides access to sublexical segmental phonological structure, permitting faster word recognition, word learning, and metalinguistic understanding (Law and Edwards, 2015). From a public health perspective, teaching all parents to maximize their children’s language development is part of the role of the SLP. For children with DPD it is especially important that parents not be so focused on “speech homework” that daily shared reading is set aside. SLPs can help the parents of children with DPD use shared reading as an opportunity to strengthen their child’s language and literacy skills and provide opportunities for speech practice (p. 469).”.

 

Conversations with SLPs: Nonword Practice Stimuli

I often answer queries from speech-language pathologists about their patients or more abstract matters of theory or clinical practice and sometimes the conversations are general enough to turn into blog topics. On this occasion I was asked my opinion about a specific paper with the question being generally about the credibility of the results and applicability of the findings to clinical practice:

Gierut, J., Morrisette, M. L., & Ziemer, S. M. (2010). Nonwords and generalization in children with phonological disorders. American Journal of Speech-Language Pathology, 19, 167-177.

In this paper the authors conduct a retrospective review of post treatment results obtained from 60 children with a moderate-to-severe phonological delay who had been treated in the context of research projects gathered under the umbrella of the “learnability project”. Half of these children had been taught nonwords and the remainder real words, representing phonemes for which the children demonstrated no productive phonological knowledge. The words (both the nonword targets and the real word targets) were taught in association with pictured referents, first in imitation and then in spontaneous production tasks. Generalization to real word targets was probed post-treatment. Note that the phonemes probed included those that were treated and any others that the child did not produce accurately at baseline. The results show an advantage to treated over untreated phonemes that is maintained over a 55 day follow-up interval. Greater generalization was observed for children who received treatment for nonwords compared to those children who received treatment for real words, but only for treated phonemes and only immediately post treatment because over time the children who received treatment for real words caught up to the other group.

OK, so what do I think about this paper. Overall, I think that it provides evidence that it is not harmful to use nonwords in treatment which is a really nice result for researchers. As Gierut et al explain, nonwords are handy because “they have been incorporated into research as a way of ensuring experimental control within and across children and studies.” They can be designed to target the specific phonological strengths and needs of each child and it is very unlikely that the family or school personnel will practice them outside of clinic and therefore it is possible to conclude that change is due to the experimental manipulation. Gierut et al go one step further however and conclude that nonword stimuli might offer an advantage for generalization learning because “the newness of the treated items might reduce interference from known words.” Here I think that the evidence is weaker simply because this is a nonexperimental study. The retrospective nature of the study and the fact that children were not assigned with blind random assignment in one cohort to be taught with one set of stimuli vs the other while holding other aspects of the design constant limits the conclusions that one can draw. For example, the authors point out that the children who were treated with nonwords received more treatment sessions than those treated with real words. Therefore, in terms of clinical implications, the study does not offer much guidance to the SLP beyond suggesting that there may be no harm in using nonword stimuli if the SLP has specific reasons for doing so.

We can offer experimental prospective evidence on this topic from my lab however. It is also limited in that it involves only two children but they were both treated with a single subject randomization design that provides excellent internal validity. This study was conducted by my former student Dr. Tanya Matthews with support from Marla Folden, M.Sc., S-LP(C). The interventions were provided by McGill students in speech-language pathology who were completing their final internship. The two children presented with very different profiles: TASC02 had childhood apraxia of speech with an accompanying cognitive delay and ADHD. TASC33 presented with a mild articulation delay and verbal and  nonverbal IQ within normal limits.

Both children were treated according to the same protocol: they received 18 treatment sessions, provided 3 per week for 6 weeks. Each week they experience three different treatment conditions, randomly assigned to one of the 3 sessions and a unique target as shown in the table below for the two children. Each session consisted of a preprepractice portion and a practice portion. The prepractice was either Mixed Procedures (auditory bombardment, error detection tasks, phonetic placement, segmentation and chaining of segments with the words) or Control (no prepractice). In all three conditions practice was high intensity practice employing principles of motor learning.

realword vs nonword conditions

Random assignment of condition/target pairs to sessions within weeks permits the use of resampling tests to determine if there are statistically significant differences in outcomes as a function of treatment condition. Outcomes were assessed via imitation probes that were administered at the end of each treatment session to measure generalization to untreated items (same day probes) and probes that were administered approximately 2 days later (at the beginning of the next treatment session) to measure maintenance of those learning gains (next day probes). The next table shows the mean probe scores by condition and child, the test statistic (squared mean differences across conditions) and the associated p value for the treatment effect for each child.

realword vs nonword outcomes

The data shown in this table reveal no significant results for either child for same day or next day probe scores. In other words there was no advantage to the prepractice versus no prepractice condition and there was no advantage to nonword practice over real word practice.

We hope to publish some data soon that suggests that the specific type of prepractice might make a difference for certain children. But overall the most important driver of outcomes for children with speech sound disorders seems to be practice and lots of it.

Reproducibility: Which Levers?

I was reading about health behavior change today and I was reminded that there is a difference between a complicated system and a complex system (D.T. Finegold and colleagues) and it crystalized for me why  the confident pronouncements of the reproducibility folks strike me as earnest but often misguided. If you think about it, most laboratory experiments are complicated systems that are meant to be roughly linear: There may be a lot of variables and many people involved in the manipulation or measurement of those variables but ultimately those manipulations and measurements should lead to observed changes in the dependent variable and then there is a conclusion; by linear system I mean that these different levels of the experiment are not supposed to contaminate each other. There are strict rules and procedures, context-specific of course, for carrying out the experiment and all the people involved need to be well trained in those procedures and they must follow the rules for the experiment to have integrity. Science itself is another matter altogether. It is a messy nonlinear dynamic complex system from which many good and some astounding results emerge, not because all the parts are perfect, but in spite of all the imperfection and possibly because of it. Shiffren, Börner and Stigler (2018) have produced a beautiful long read that describes this process of “progress despite irreproducibility.” I will leave it to them to explain it since they do it so well.

I am certain that the funders and the proponents of all the proposals to improve science are completely sincere but we all know that the road to hell is paved with good intentions. The reason that the best intentions are not going to work well in this case is that the irreproducibility folks are trying to “fix” a complex system by treating it as if it is a complicated problem. Chris Chambers tells a relatively simple tale in which a journal rejects a paper (according to his account) because a negative result was reported honestly which suggests that a focus on positive results rewards cheating to get those results and voilà: the solution is to encourage publication without the results. This idea is fleshed out by Nosek et al (2018) in a grand vision of a “preregistration revolution” which cannot possibly be implemented as imagined or result in the conceived outcomes. All possible objections have been declared to be false (bold print by Chris Chambers) and thus they have no need of my opinion. I am old enough to be starting my last cohort of students so I have just enough time to watch them to get tangled up in it. I am a patient person. I can wait to see what happens (although curiously no objective markers of the success of this revolution have been definitively put forward).

But here’s the thing. When you are predicting the future you can only look to the past. So here are the other things that I read today that lead me to be quite confident that although science will keep improving itself as it always has done, at least some of this current revolution will end up in the dust. First, on the topic of cheating, there is quite a big literature on academic cheating by undergraduate students which is directly relevant to the reproducibility movement. You will not be surprised to learn that (perceived) cheating is contagious. It is hard to know the causal direction – it is probably reciprocal. If a student believes that everyone is cheating the likelihood that the student will cheat is increased. Students who cheat believe that everyone else is cheating regardless of the actual rate of cheating. Students and athletes who are intrinsically versus extrinsically motivated are also less likely to cheat so it is not a good idea to undermine intrinsic motivation with excessive extrinsic reward systems, especially those that reduce perceived autonomy. Cheating is reduced by “creating a deeply embedded culture of integrity:” Culture is the important word here because most research and most interventions target individuals but it is culture and systems that need to be changed. Accomplishing a culture of integrity includes (perhaps you will think paradoxically) creating a trusting and supportive atmosphere with reduced competitive pressures while ensuring harsh and predictable consequences for cheating. The reproducibility movement has taken the path of deliberately inflating the statistics on the prevalence of questionable research practices with the goal of manufacturing a crisis, under the mistaken belief that the crisis narrative is necessary to motivate change when it is more likely that this narrative will actually increase cynicism and mistrust, having exactly the opposite effect.

The second article I read that was serendipitously relevant was about political polarization. Interestingly, it turns out that perceived polarization reduces trust in government whereas actual polarization between groups is not predictive of trust, political participation and so on. It is very clear to me that the proponents of this movement are deliberately polarizing and have been since the beginning, setting hard scientists against soft, men against woman and especially the young against the old (I would point to parts of my twitter feed as proof of this I but I don’t need to contaminate your day with that much negativity, suffice to say it is not a trusting and supportive atmosphere). The Pew Center shows that despite decades of a “war against science” we remain one of the most trusted groups in society. It is madness to destroy ourselves from within.

A really super interesting event that happened in my tweet feed today was the release of the report detailing the complete failure of the Gates Foundation $600M effort to improve education by waving sticks and carrots over teachers with the assumption that getting rid of bad teachers was a primary “lever” that when pulled would spit better educated minority students out the other end (seriously, they use the word levers, it cracks me up; talk about mistaking a complex system for a complicated one). Anyway, it didn’t work. The report properly points out that that the disappointing results may have occurred because their “theory of action” was wrong. There just wasn’t enough variability in teacher quality even at the outset for all that focus on teacher quality to make that much difference especially since the comparison schools were engaged in continuous improvement in teacher quality as well. But of course the response on twitter today has been focused on teacher quality: many observers figure that the bad teachers foiled the attempt through resistance, of course! The thing is that education is one of those systems in our society that actually works really well, kind of like science. If you start with the assumption that that the scientists are the problem and if you could just get someone to force them to shape up (see daydream in this blog by Lakens in which he shows that he knows nothing about professional associations despite his excellence as a statistician)…well, I think we have another case of people with money pulling on levers with no clue what is behind them.

And finally, let’s end with the Toronto Star, an excellent newspaper, that has a really long read (sorry, its long but really worth your time) describing a dramatic but successful change in a nursing home for people with dementia. It starts out as a terrible home for people with dementia and becomes a place you would (sadly but confidently) place your family member. This story is interesting because you start with the sense that everyone must have the worst motives in order for this place to be this bad—care-givers, families, funders, government—and end up realizing that everyone had absolutely the best intentions and cared deeply for the welfare of the patients. The problem was an attempt to manage the risk of error and place that goal above all others. You will see that the result of efforts to control error from the top down created the hell that the road paved with good intentions must inevitably create.

So this is it, I may be wrong and if I am it will not be the first time. But I do not think that scientists have been wasting their time for the last 30 years as one young person declared so dramatically in my twitter feed. I don’t think that they will waste the next 30 years either because they will mostly keep their eye on whatever it is that motivated them to get into this crazy business. Best we support and help each other and let each other know when we have improved something but at the same time not get too caught up in trying to control what everyone else is doing. Unless of course you are so disheartened with science you would rather give it up and join the folks in the expense account department.

Post-script on July 7, 2018: Another paper to add to this grab-bag:

Kaufman, J. C., & Glǎveanu, V. P. (2018). The Road to Uncreative Science Is Paved With Good Intentions: Ideas, Implementations, and Uneasy Balances. Perspectives on Psychological Science, 13(4), 457-465. doi:10.1177/1745691617753947

I liked this perspective on science:

“The propulsion model is concerned with how a creative work affects the field. Some types of contributions stay within the existing paradigm. Replications,1 at the most basic level, aim to reproduce or recreate a past successful creation, whereas redefinitions take a new perspective on existing work. Forward or advance forward incrementations push the field ahead slightly or a great deal, respectively. Forward incrementations anticipate where the field is heading and are often quite successful, whereas advance forward incrementations may be ahead of their time and may be recognized only retrospectively. These categories stay within the existing paradigm; others push the boundaries. Redirections, for example, try to change the way a field is moving and ake it in a new direction. Integrations aim to merge two fields, whereas reinitiation contributions seek to entirely reinvent what constitutes the field.”

 

Reproducibility: Solutions (not)

Let’s go back to the topic of climate change since BishopBlog started this series of blogposts off by suggesting that scientists who question the size of the reproducibility crisis are playing a role akin to climate change deniers, by analogy with Orestes and Conway’s argument in Merchants of Doubt. While some corporate actors have funded doubting groups in an effort to protect their profits, as I discussed in my previous blogpost, others have capitalized on the climate crisis to advance their own interests. Lyme disease is an interesting case study in which public concern about climate change gets spun into a business opportunity, like this: climate change → increased ticks → increased risk of tick bites → more people with common symptoms including fever and fatigue and headaches → that add up to Chronic Lyme Disease Complex → need for repeated applications of expensive treatments → such as for example chelation because heavy metals toxicities. If I lost you on that last link, well, that’s because you are a scientist. But nonscientists like Canadian Members of Parliament got drawn into this and now a federal framework to manage Lyme Disease is under development because the number of cases almost tripled over the past five years to, get this, not quite 1000 cases (confirmed and probable). The trick here is that if any one of the links seems strong to you the rest of the links shimmer into focus like the mirage that they are. And before you can blink, individually and collectively, we are hooked into costly treatments that have little evidence of benefits and tenuous links to the supposed cause of the crisis.

The “science in crisis” narrative has a similar structure with increasingly tenuous links as you work your way along the chain: pressures to publish → questionable research practices → excessive number of false positive findings published → {proposed solution} → {insert grandiose claims for magic outcomes here}. I think that all of us in academia at every level will agree that the pressures to publish are acute. Public funding of universities has declined in the U.K., the U.S. and in Canada and I am sure in many other countries as well. Therefore the competition for students and research dollars is extremely high and governments have even made what little funding there is contingent upon the attraction of those research dollars. Subsequently there is overt pressure on each professor to publish a lot (my annual salary increase is partially dependent upon my publication rate for example). Furthermore, pressure has been introduced by deliberately creating a gradient of extreme inequality among academics so that expectations for students and early career researchers are currently unrealistically high. So the first link is solid.

The second link is a hypothesis for which there is some support although it is shaky in my opinion due to the indirect nature of the evidence. Nonetheless, it is there. Chris Chambers tells this curious story where, at the age of 22 and continuing forward, he is almost comically enraged that top journals will not accept work that is good quality because the outcome was not “important” or “interesting.” And yet there are many lesser tier journals that will accept such work and many researchers have made a fine career publishing in them until such time as they were lucky enough to happen upon whatever it was that they devoted their career to finding out. The idea that luck, persistence and a lifetime of subject knowledge should determine which papers get into the “top journals” seems right to me. There is a problem when papers get into top journals only because they are momentarily attention-grabbing but that is another issue. If scientists are tempted to cheat to get their papers in those journals before their time they have only themselves to blame. One big cause of “accidental” findings that end up published in top or middling journals seems to be low power however, which can lead to all kinds of anomalous outcomes that later turn out to be unreliable. Why are so many studies underpowered? Those pressures to publish play a role as it is possible to publish many small studies rather than one big one (although curiously it is reported that publication rates have not changed in decades after co-authorship is controlled even though it seems undeniable that the pressure to publish has increased in recent times). Second, research grants are chronically too small for the proposed projects. And those grants are especially too small for woman and in fields of study that are quite frankly gendered. In Canada this can be seen in a study of grant sizes within the Natural Sciences and Engineering Research Council and by comparing the proportionately greater size of cuts to the Social Sciences and Humanities Research Council.

So now we get to the next two links in the chain. I will focus on one of the proposed solutions to the “reproducibility crisis” in this blog and come back to others in future posts. There is a lot of concern about too many false positives published in the literature (I am going to let go the question about whether this is an actual crisis or not for the time being and skip to the next link, solutions for that problem). Let’s start with the suggestion that scientists dispense with the standard alpha level of .05 for significance and replace it with p < .005 which was declared recently by a journalist (I hope the journalist and not the scientists in question) to be a raised standard for statistical significance. An alpha level is not a standard. It is a way of indicating where you think the balance should be between Type I and Type II error. But in any case, the proposed solution is essentially a semantic change. If a study yields a p-value between .05 and .005 the researcher can say that the result is “suggestive” and if it is below .005 the researcher can say that it is significant according to this proposal. The authors say that further evidence would need to accumulate to support suggestive findings but of course further evidence would need to accumulate to confirm the suggestive and the significant findings (it is possible to get small p values with an underpowered study and I thought the whole point of this crisis narrative was to get more replications!). However, with this proposal the idea seems to be to encourage studies to have a sample size 70% larger than is currently the norm. This cost is said to be offset by the benefits, but, as Timothy Bates points out, there is no serious cost-benefit analysis in their paper. And this brings me to the last link. This solution is proposed as a way of reducing false positives markedly which in turn will increase the likelihood that published findings will be reproducible. And if everyone magically found 70% more research funds this is possibly true. But where is the evidence that the crisis in science, whatever that is, would be solved? It is the magic in the final link that we really need to focus on.

I am a health care researcher so it is a reflex for me to look at a proposed cure and ask two questions (1) does the cure target the known cause of the problem? (2) is the cure problem-specific or is it a cure-all? Here we have a situation where the causal chain involves a known distal cause (pressure to publish) and known proximal cause (low power). The proposed solution (rename findings with p between .05 to .005 suggestive) does not target either of these causes. It does not help to change the research environment in such a way as to relieve the pressure to publish or to help researchers obtain the resources that would permit properly powered studies (interestingly the funders of the Open Science Collaborative have enough financial and political power to influence the system of public pensions in the United States and therefore, improving the way that research is funded and increasing job stability for academics are both goals within their means but not, as far as I can see, goals of this project). Quite the opposite in fact because this proposal is more likely to increase competition and inequality between scientists than to relieve those pressures and therefore the benefits that emerge in computer modeling could well be outweighed by the costs in actual application. Secondly, the proposed solution is not “fit for purpose”. It is an arbitrary catch-all solution that is not related to the research goals in any one field of study or research context.

That does not mean that we should do nothing and that there are no ways to improve science. Scientists are creative people and each in their own ponds have been solving problems long before these current efforts came into view. However, recent efforts that seem worthwhile to me and that directly target the issue of power (in study design) recognize the reality that those of us who research typical and atypical development in children are not likely to ever have resources to increase our sample sizes by 70%. So, three examples of helpful initiatives:

First, efforts to pull samples together through collaboration are extremely important. One that is fully on-board with the reproducibility project is of course the ManyBabies initiative. I think that this one is excellent. It takes place in the context of a field of study in which labs have always been informally  inter-connected not only because of shared interests but because of the nature of the training and interpersonal skills that are required to run those studies. Like all fields of research there has been some partisanship (I will come back to this because it is a necessary part of science) but also a lot of collaboration and cross-lab replication of studies in this field for decades now. The effort to formalize the replications and pool data is one I fully support.

Second, there have been ongoing and repeated efforts by statisticians and methodologists to teach researchers how to do simple things that improve their research. Altman sadly died this week. I have a huge collection of his wonderful papers on my hard-drive for sharing with colleagues and students who surprise me with questions like How to randomize? The series of papers by Cumming and Finch on effect sizes along with helpful spreadsheets are invaluable (although it is important to not be overly impressed by large effect sizes in underpowered studies!). My most recent favorite paper describes how to chart individual data points, really important in a field such as ours in which we so often study small samples of children with rare diagnoses. I have an example of this simple technique elsewhere on my blog. If we are going to end up calling all of our research exploratory and suggestive now (which is where we are headed, and quite frankly a lot of published research in speech-language pathology has been called that all along without ever getting to the next step), let’s at least describe those data in a useful fashion.

Third, if I may say so myself, my own effort to promote the N-of-1 randomized control design is a serious effort to improve the internal validity of single case research for researchers who, for many reasons, will not be able to amass large samples.

In the meantime, for those people suggesting the p < .005 thing, it seems irresponsible to me for any scientist to make a claim such as “reducing the P-value threshold for claims of new discoveries to 0.005 is an actionable step that will immediately improve reproducibility” on the basis of a little bit of computer modeling, some sciencey looking charts with numbers on them and not much more thought than that.  I come back to the point I made in my first blog on the reproducibility crisis and that is that if we are going to improve science we need to approach the problem like scientists. Science requires clear thinking about theory (causal models), the relationship between theory and reality, and evidence to support all the links in the chain.

Using SAILS to Assess Speech Perception in Children with SSD

I am very excited to see an Australian replication of the finding that children with a Speech Sound Disorder (SSD) have difficulty with speech perception when tested with a word identification test implemented with recordings of children’s speech. Hearnshaw, Baker, and Munro (2018) created a task modeled on my Speech Assessment and Interactive Learning (SAILS) program. A different software platform was used to present the stimuli and record the children’s responses. The critical elements of SAILS were otherwise replicated but there were some significant differences as shown in the table below.

Hearnshaw compare SAILS

The most important differences are the younger age of the children and the targeting of phonemes with older expected ages of acquisition. Furthermore there are 12 stimuli per block and two target words per target phoneme in Hearnshaw versus 10 stimuli per block and one target word per target phoneme in my own assessment studies. In Hearnshaw the practice procedures involved fewer stimuli and less training on the task. Finally, the response mode was more complex in Hearnshaw and the response alternatives do not replicate mine. Therefore this study does not constitute a replication of my own studies and I might expect lower performance levels compared to that observed by the children tested in my own studies (I say this before setting up the next table, let’s see what happens). None-the-less, we would all expect that children with SSD would underperform their counterparts with typically developing speech especially given the close matching on age and receptive vocabulary in Hearnshaw and my own studies.

Hearnshaw SAILS data comparison table

Looking at the data in the above table, the performance of the children with SSD is uniformly lower than that of the typically developing comparison groups. Hearnshaw’s SSD group obtained a lower score overall when compared to the large sample that I reported in 2006 but slightly higher when compared to the small sample that I reported in 2003 (that study was actually Alyssa Ohberg’s undergraduate honours thesis). It is not clear that any of these differences are statistically significant so I plotted them with standard error bars below.

Hearnshaw SAILS comparison figure

The chart does reinforce the impression that the differences between diagnostic groups are significant. It is not clear about the differences across studies. It is possible that the children that Alyssa tested were more severely impaired than all the others (the GFTA is not the same as the DEAP so it is difficult to compare) or more likely the best estimate is in the third study with the largest sample size. Nonetheless, the message is clear that typically developing children in this age range will achieve scores above 70% accurate whereas children with SSD are more likely to achieve scores below 70% accurate which suggests that they are largely guessing when making judgements about incorrectly produced exemplars of the target words. Hearnshaw et al. and I both emphasize the within group variance in perceptual performance by children with SSD. Therefore, it is important to assess these children’s speech perception abilities in order to plan the most suitable intervention.

And with that I am happy to announce that the iPad version of SAILS is now available with all four modules necessary to compare to the normative data that is presented below for three age groups.

SAILS Norms RBL 2018

Specifically, the modules that are currently available for purchase ($5.49 CAD per module) are as follows:

-“k”: cat (free)

-“l”: lake

-“r”: rat, rope, door

“s”: Sue, soap, bus

Please see www.dialspeech.com for more information from me and Alex Herbay who wrote the software, or go directly to the app store: SAILS by Susan Rvachew and Alex Herbay

Reproducibility: On the Nature of Scientific Consensus

The idea that scientists who raise questions about whether (ir)reproducibility is a crisis or not are like the “merchants of doubt” is argued via analogy with, for example, climate change deniers. It’s a multistep analogy. First there is an iron-clad consensus on the part of scientists that humans are causing a change in the climate that will have catastrophic consequences. Because the solutions to the problem threaten corporate interests, those big money interests astroturf groups like “Friends of Science” to sow doubt about the scientific consensus in order to derail the implementation of positive policy options. For the analogy on Bishop’s Blog to work, there must first be a consensus among scientists that the publication of irreproducible research is a crisis, a catastrophe even. I am going to talk about this issue of consensus today although it would be more fun to follow that analogy along and try to figure out whether corporate interests are threatened by more or less scientific credibility and how the analogy works when it is corporate money that is funding the consensus and not the dissenters! But anyway, on the topic of consensus…

The promoters of the reproducibility crisis have taken to simply stating that there is a consensus, citing most frequently a highly unscientific Nature poll. I know how to create scientific questionnaires (it used to be part of my job in another life before academia) and it is clear that the question “Is there a reproducibility crisis?” with the options “crisis,” “slight crisis” (an oxymoron) and “no crisis” is a push poll. The survey was designed to make it possible for people to claim “90% of respondents to a recent survey in Nature agreed that there is a reproducibility crisis” which is how you sell toothpaste, not determine whether there is a crisis or not. On twitter I have been informed, with no embarrassment, that unscientific polls are justified because they are used to “raise awareness”. The problem comes when polls that are used to create a consensus are also used as proof of that consensus. How does scientific consensus usually come about?

In many areas of science it is not typical for groups of scientists to formally declare a consensus about a scientific question but when there are public or health policy implications working groups will create consensus documents, always starting with a rigorous procedure for identifying the working group, the literature or empirical evidence that will be considered, the standards by which that evidence will be judged and the process by which the consensus will emerge. Ideally it is a dynamic and broad based exercise. The Intergovernmental Panel on Climate Change is a model in this regard and it is the rigorous nature of this process that allows us to place our trust in the consensus conclusion even when we are not experts in the area of climate. A less complex and for us more comprehensible example is the recent process employed by the CATALISE consortium to propose that Specific Language Impairment be reconceptualised as Developmental Language Disorder. This process meets all the requirements of a rigorous process with the online Delphi technique an intriguing part of the series of events that led to a set of consensus statements about the identification and classification of developmental language disorders. Ultimately each statement is supported by a rationale from the consortium members including scientific evidence when available. The consortium itself was broad based and the process permitted a full exposition of points of agreement and disagreement and needs for further research. For me, importantly, a logical sequence of events and statements is involved-the assertion that the new term be used was the end of the process, not the beginning of it. The field of speech-language pathology as a whole has responded enthusiastically even though there are financial disincentives to adopting all of the recommendations in some jurisdictions. Certainly the process of raising awareness of the consensus documents has had no need of push polls or bullying. One reason that the process was so well received, beyond respect for the actors and the process, is that the empirical support for some of the key ideas seems unassailable. Not everyone agrees on every point and we are all uncomfortable with the scourge of low powered studies in speech and language disorders (an inevitable side effect of funder neglect); however, the scientific foundation for the assertion that language impairments are not specific has reached a critical mass, and therefore no-one needs to go about beating up any “merchants of doubt” on this one. We trust that in those cases where the new approach is not adopted it is generally due to factors outside the control of the individual clinician.

The CATALISE process remains extraordinary however. More typically a consensus emerges in our field almost imperceptibly and without clear rationale. When I was a student in 1975 I was taught that children with “articulation disorders” did not have underlying speech perception deficits and therefore it would be a waste of time to implement any speech perception training procedures (full stop!). When I began to practice I had reason to question this conclusion (some things you really can see with your own eyes) so I drove into the university library (I was working far away in a rural area) and started to look stuff up. Imagine my surprise when I found that the one study cited to support this assertion involved four children who did not receive a single assessment of their speech perception skills (weird but true). Furthermore there was a long history of studies showing that children with speech sound disorders had difficulties with speech discrimination. I show just a few of these in the chart below (I heard via Twitter that, at the SPA conference just this month in Australia, Lise Baker and her students reported that 83% of all studies that have looked at this question found that children with a speech sound disorder have difficulties with speech perception). So, why was there this period from approximately 1975 through about 1995 when it was common knowledge that these kids had no difficulty with speech perception? In fact some textbooks still say this. Where did this mistaken consensus come from?

When I first found out that this mistaken consensus was contrary to the published evidence I was quite frankly incandescent with rage! I was young and naïve and I couldn’t believe I had been taught wrong stuff. But interestingly the changes in what people believed to be true were based on changes in the underlying theory which is changing all the time. In the chart below I have put the theories and the studies alongside each other in time. Notice that the McReynolds, Kohn, and Williams (1975) paper which found poorer speech perception among the SSD kids, actually concluded that they didn’t, contrary to their own data but consistent with the prevailing theory at the time!

History of Speech Perception Research

What we see is that in the fifties and sixties, when it was commonly assumed that higher level language problems were caused by impairments in lower level functions, many studies were conducted to prove this theory and in fact they found evidence to support that theory with some exceptions. In the later sixties and seventies a number of theories were in play that placed strong emphasis on innate mechanisms. There were few if any  studies conducted to examine the perceptual abilities of children with speech sound disorders because everyone just assumed they had to be normal on the basis of the burgeoning field of infant perceptual research showing that neonates could perceive anything (not exactly true but close enough for people to get a little over enthusiastic). More recently emergentist approaches have taken hold and more sophisticated techniques for testing speech perception have allowed us to determine how children perceive speech and when they will have difficulty perceiving it. The old theories have been proved wrong (not everyone will agree on this because the ideas about lower level sensory or motor deficits are zombies; the innate feature detector idea, that is completely dead; for the most part, the evidence is overwhelming and we have moved on to theories that are considerably more complex and interesting, so much so that I refer you to my book rather than trying to explain them here).

The question is, on the topic of reproducibility, whether it would have been or would be worthwhile for anyone to try and reproduce, let’s say Kronvall and Diehl (1952) just for kicks? No! That would be a serious waste of time as my master’s thesis supervisor explained to me in the eighties when he dragged me more-or-less kicking and screaming into a room with a house-sized vax computer to learn how to synthesize speech (I believe I am the first person to synthesize words with fricatives, it took me over a year). It is hard to assess the clinical impact of all that fuzzy thinking through the period 1975 – 1995. But somehow, in the long run we have ended up in a better place. My point is that scientific consensus arises from an odd and sometimes unpredictable mixture of theory and evidence and it is not always clear what is right and what is wrong until you can look back from a distance. And despite all the fuzziness and error in the process, progress marches on.

Reproducibility crisis: How do we know how much science replicates?

Literature on the “reproducibility crisis” is increasing although not rapidly enough to bring much empirical clarity to the situation. It remains uncertain how much published science is irreproducible and whether the proportion, whatever it may be, constitutes a crisis. And like small children, unable to wait for the second marshmallow, some scientists in my twitter feed seem to have grown tired of attempting to answer these questions via the usual scientific methods; rather they are declaring in louder and louder voices that there IS a reproducibility crisis as if they can settle these questions by brute force. They have been reduced in the past few days to, I kid you not, (1) twitter polls; (2) arguing about whether 0 is a number; and most egregiously, (3) declaring that “sowing doubt” is akin to being a climate change denier.

Given that the questions at hand here have not at all been tested in the manner of climate science to yield such a consensus, this latter tactic is so outrageously beyond the pale, I am giving over my blog to comment on the reproducibility crisis for some time, writing willy-nilly as the mood hits me on topics such as its apparent size, its nature, its causes, its consequences and the proposed solutions. Keeping in mind that the readers of my blog are usually researchers in communication sciences and disorders, as well as some speech language pathologists, I will bring the topic home to speech research each time. I promise that although there may be numbers there will be no math, I leave the technical aspects to others as it is the philosophical and practical aspects of the question that concern me.

Even though I am in no mind to be logical about this at all, let’s start at the beginning, (unless you think this is the end, which would not be unreasonable). Is there in fact a consensus that that there is a reproducibility crisis? I will leave aside for the moment the fact that there is not even a consensus about what the word “reproducibility” means or what exactly to call this crisis. Notwithstanding this basic problem with concepts and categories, the evidence for the notion that there is a crisis comes from three lines of data: (1) estimates of what proportion of science can be replicated, that is if you reproduce the methods of a study with different but similar participants, are the original results confirmed or replicated; (2) survey results of scientists’ opinions about how much science can be reproduced and whether reproducibility is a crisis or not; and less frequently (3) efforts to determine whether the current rate of reproducibility or irreproducibility is a problem for scientific progress itself.

I am going to start with the first point and carry on to the others in later posts so as not to go on too long (because I am aware there is nothing new to be said really, it is just that it is hard to hear over the shouting about the consensus we are all having). I have no opinion on how much science can be replicated. I laughed when I saw the question posed in a recent Nature poll “In your opinion, what proportion of published results in your field are reproducible?” (notice that the response alternatives were percentages from 0 to 100% in increments of 10 with no “don’t know” option). The idea that people answered this! For myself I would simply have no basis for answering the question.  I say this as a person who is as well read in my field as anyone, after 30 years of research and 2 substantive books. So faced with it, I would have no choice but to abandon the poll because being a scientist, my first rule is don’t  make shit up. It’s a ridiculous question to ask anyone who has not set out to answer it specifically. But if I were to set out to answer it, I would have to approach it like any other scientific problem by asking first, what are the major concepts in the question? How can they be operationalized? How are they typically operationalized? Is there a consensus on those definitions and methods? And then having solved the basic measurement problems I would ask what are the best methods to tackle the problem. We are far from settling any of these issues in this domain and therefore it is patently false to claim that we have a consensus on the answer!

The big questions that strike me as problematic are what counts as a “published result,” more importantly what counts as a “representative sample” of published results, and finally, what counts as a “replication”. Without rehashing all the back-and-forth in Bishop’s blog and the argument that many are so familiar with we know that there is a lot of disagreement about the different ways in which these questions might be answered and what the answer might be. Currently we have estimates on the table for “how much science can be replicated” that range from “quite high” (based on back of the envelope calculations), through 30 to 47%ish (based on actual empirical efforts to replicate weird collections of findings) through finally, Ioannidis’ (2005) wonderfully trendy conclusion that “most published research findings are false” based on simulations. I do not know the answer to this question. And even if I did, I wouldn’t know how to evaluate it because I have no idea how much replication is the right amount when it comes to ensuring scientific progress. I will come back to that on another day. But for now my point is this: there is no consensus on how much published science can be replicated. And there is no consensus on how low that number needs to go before we have an actual crisis. Claiming that there is so much consensus that raising questions about the answers to these questions is akin to heresy is ridiculous and sad. Because there is one thing I do know: Scientists get to ask questions. That is what we do. More importantly, we answer them. And we especially don’t pretend to have found the answers when we have barely started to look.

I promised to put a little speech therapy into this blog so here it is. The Open Science Collaboration said reasonably that there “is no single standard for evaluating replication success. Here, we evaluated reproducibility using significance and P values, effect sizes, subjective assessments of replication teams, and meta-analysis of effect sizes.” More substantively, even if a research team picks an indicator there is disagreement about how to use it. Take effect size for example: it is not clear to me why replication attempts are expect to replicate the size of the effect size that is observed in the original study or even how one does that exactly. There is a lot of argument about that nonetheless which makes it hard to decide whether a finding has been replicated or not. How to determine whether a finding has been confirmed or replicated is not a trivial issue. I grapple with replication of my own work all the time because I develop interventions and I really want to be sure that they work. But even a small randomized controlled trial costs me seven to ten years of effort from getting funds through publication, explaining why I have accomplished only five of these in my career. Therefore, confirming my own work is no easy matter. I always hope someone else will replicate one of those trials but usually if someone has that many resources, they work on their own pet intervention, not mine. So lately I have been working on a design that makes it easier (not easy!) to test and replicate interventions on small groups of participants. It is called the single subject randomization design.

Here is some data that I will be submitting for publication soon. We treated six children with Childhood Apraxia of Speech, using an approach that involved auditory-motor integration prepractice plus normal intense speech practice (AMI). We expected it to be better than just intense speech practice alone, our neutral usual-care control condition (CTL). We also expected it to be better than an alternative treatment that is contra-indicated for kids with this diagnosis (PMP). We repeated the experiment exactly using a single subject randomization design over 6 children and then pooled the p values. All the appropriate controls for internal validity were employed (randomization with concealment, blinding and so on). The point from the perspective of this blog is that there are different kinds of information to evaluate the results, the effect sizes, the confidence intervals for the effect sizes, the p values, and the pooled p values. So, from the point of view of the reproducibility project, these are my questions: (1) how many findings will I publish here? (2) how many times did I replicate my own finding(s)?

TASC MP Group

Is Acoustic Feedback Effective for Remediating “r” Errors?

I am very pleased to see a third paper published in the speech-language pathology literature using the single-subject randomization design that I have described in two tutorials, the first in 1988 and the second more recently. Tara McAllister Byun used the design to investigate the effectiveness of acoustic biofeedback treatment to remediate persistent “r” errors in 7 children aged 9 to 15 years. She used the single subject randomized alternation design with block randomization, including a few unique elements in her implementation of the design. She and her research team provided one traditional treatment session and one biofeedback treatment session each week for ten weeks. However the order of the traditional and biofeedback sessions was randomized each week. Interestingly, each session targeted the same items (i.e., “r” was the speech sound target  in both treatment conditions): rhotic vowels were tackled first and consonantal “r” was introduced later, in a variety of phonetic contexts. (This procedure is a variance from my experience in which, for example, Tanya Matthews and I randomly assign different targets to different treatment conditions). Another innovation is the outcome measure: a probe constructed of untreated “r” words was given at the beginning and end of each session so that change (Mdif) over the session was the outcome measure submitted to statistical analysis (our tutorial explains that the advantage of the SSRD is that a nonparametric randomization test can be used to assess the outcome of the study, yielding a p value).  In addition, 3 baseline probes and 3 maintenance probes were collected so that an effect size for overall improvement could be calculated. In this way there are actually 3 time scales for measuring change in this study: (1) change from baseline to maintenance probes; (2) change from baseline to treatment performance as reflected in the probes obtained at the beginning of each session and plotted over time; and (3) change over a session, reflected in the probes given at the beginning and the end of each session. Furthermore, it is possible to compare differences in within session change for sessions provided with and without acoustic feedback.

I was really happy to see the implementation of the design but it is fair to say that the results were a dog’s breakfast, as summarized below:

Byun 2017 acoustic biofeedback

The table indicates that two participants (Piper, Clara) showed an effect of biofeedback treatment and generalization learning. Both showed rapid change in accuracy overall after treatment was introduced in both conditions and maintained at least some of that improvement after treatment was withdrawn. Garrat and Ian showed identical trajectories in the traditional and biofeedback conditions with a late rise in accuracy during treatment session, large within session improvements during the latter part of the treatment period, and good maintenance of those gains. Neither boy achieved 60% correct responding however at any point in the treatment program. Felix, Lucas and Evan demonstrated no change in probe scores across the twenty weeks of the experiment in both conditions. Lucas started at a higher level and therefore his probe performance is more variable: because he actually showed a within session decline during traditional sessions while showing stable performance within biofeedback sessions, the statistics indicate a treatment effect in favour of acoustic biofeedback but in fact no actual gains are observed.

So, this is a long description of the results that brings me to two conclusions: (1) the alternation design was the wrong choice for the hypothesis in these experiments; and (2) biofeedback was not effective for these children; even in those cases where it looks like there was an effect, the children were responsive to both biofeedback and the traditional intervention.

In a previous blog, I described the alternation design; there is another version of the single subject randomization design that would be more appropriate for Tara’s hypothesis however.  The thing about acoustic biofeedback is that it is not fundamentally different from traditional speech therapy, involving a similar sequence of events: (i) SLP says a word as an imitative model; (ii) child imitates the word; (iii) SLP provides informative or corrective feedback. In the case of incorrect responses in the traditional condition in Byun’s study, the SLP provided information about articulatory placement and reminded the child that the target involved certain articulatory movements (“make the back part of your tongue go back”). In the case of incorrect responses in the acoustic biofeedback condition, the SLP made reference to the acoustic spectrogram when providing feedback and reminded the child that the target involved certain formant movements (“make the third bump move over”). Firstly, the first two steps are completely overlapping in both conditions and secondly it can be expected that the articulatory cues given in the traditional condition will be remembered and their effects will carry-over into the biofeedback sessions. Therefore we can consider the acoustic biofeedback to be an add-on to traditional therapy. We want to know about the value added. Therefore the phase design is more appropriate: in this case, there would be 20 sessions (2 per week over 10 weeks as in Byun’s study), each session would be planned with the same format: beginning probe (optional), 100 practice trials with feedback, ending probe. The difference is that the starting point for the introduction of acoustic biofeedback would be selected at random. All the sessions that precede the randomly selected start point would be conducted with traditional feedback and all the remainder would be conducted with acoustic biofeedback. The first three would be designated as traditional and the last 3 would be designated as biofeedback for a 26 session protocol as described by Byun. Across the 7 children this would end up looking like a multiple baseline design except that (1) the duration of the baseline phase would be determined by random selection for each child; and (2) the baseline phase is actually the traditional treatment with the experimental phase testing the value added benefit of biofeedback. There are three possible categories of outcomes: no change after introduction of the biofeedback, an immediate change, or a late change. As with any single subject design, the change might be in level, trend or variance and the test statistic can be designed to capture any of those types of changes. The statistical analysis asks whether the obtained test statistic is bigger than all possible results given all of the possible random selection of starting points. Rvachew & Matthews (2016) provides a more complete  explanation of the statistical analysis.

I show below an imaginary result for Clara, using the data presented for her in Byun’s paper, as if the traditional treatment came first and then the biofeedback intervention. If we pretend that the randomly selected start point for the biofeedback intervention occurred exactly in the middle of the treatment period, the test statistic is the difference of the M(bf) and the M(trad) scores resulting in -2.308. All other possible random selections of starting points for intervention lead to 19 other possible mean differences, and 18 of them are bigger than the obtained test statistic leading to a p value of 18/20 = .9. In this data set the probe scores are actually bigger in the earlier part of the intervention when the traditional treatment is used and they do not get bigger when the biofeedback is introduced. These are the beginning probe scores obtained by Clara but Byun obtained a significant result in favour of biofeedback by block randomization and by examining change across each session. However, I am not completely sure that the improvements from beginning to ending probes are a positive sign—this result might reflect a failure to maintain gains from the previous session in one or the other condition.

Hypothetical Clara in SSR Phase Design

There are several reasons to think that both interventions that were used in Byun’s study might result in unsatisfactory generalization and maintenance. We discuss the principles of generalization in relation to theories of motor learning in Developmental Phonological Disorders: Foundations of Clinical Practice. One important principle is that the child needs a well-established representation of the acoustic-phonetic target. All seven of the children in Byun’s study had poor auditory processing skills but no part of the treatment program addressed phonological processing, phonological knowledge or acoustic phonetic representations. Second, it is essential to have the tools to monitor and use self-produced feedback (auditory, somatosensory) to evaluate success in achieving the target. Both the traditional and the biofeedback intervention put the child in the position of being dependent upon external feedback. The outcome measure focused attention on improvements from the beginning of the practice session to the end. The first principle of motor learning is that practice performance is not an indication of learning however.  The focus should have been on the sometimes large decrements in probe scores from the end of one session to the beginning of the next. The children had no means of maintaining any of those performance gains. Acoustic feedback may be a powerful means of establishing a new response but it is a counterproductive tool for maintenance and generalization learning.

Reading

McAllister Byun, T. (2017). Efficacy of Visual–Acoustic Biofeedback Intervention for Residual Rhotic Errors: A Single-Subject Randomization Study. Journal of Speech, Language, and Hearing Research, 60(5), 1175-1193. doi:10.1044/2016_JSLHR-S-16-0038

Rvachew, S., & Matthews, T. (2017). Demonstrating treatment efficacy using the single subject randomization design: A tutorial and demonstration. Journal of Communication Disorders, 67, 1-13. doi:https://doi.org/10.1016/j.jcomdis.2017.04.003