Are effect sizes in research papers useful in SLP practice?

Effect size blog figure 1Effect sizes are now required in addition to statistical significance reporting in scientific reports. As discussed in a previous blog, effect sizes are useful for research purposes because they can be aggregated across studies to draw conclusions (i.e., as, in a meta-analysis). However, they are also intended to be useful as an indication of the “practical consequences of the findings for daily life.” Therefore, Gierut, Morrisette, & Dickinson’s paper “Effect Size for Single-Subject Design in Phonological Treatment” was of considerable interest to me when it was published in 2015. They report the distribution of effect sizes for 135 multiple baseline studies using a pooled standard deviation for the baseline phase of the studies as the denominator and the mean of the treatment phase minus the mean of the baseline phase as the numerator in the equation to calculate the effect size statistic. In these studies, the mean and the variance of probe scores in the baseline phase is restricted to be very small by design, because the treatment targets and generalization probe targets must show close to stable 0% correct performance during the baseline phase. The consequence of this restriction is that the effect size number will be very large even when the raw amount of performance change is not so great. Therefore the figure above shows hypothetical data that yields exactly their average effect size of 3.66 (specifically, [8.57%-1.25%]/.02 = 3.66). This effect size is termed a medium effect size in their paper but I leave it to the reader to decide if a change of not quite 9% accuracy in speech sound production is an acceptable level of change. It may be because in these studies, a treatment effect is operationalized as probe scores (single word naming task) for all the phonemes that were absent from the child’s repertoire at intake. From the research point of view this paper provides very important information: it permits researchers to compare effect sizes and explore variables that account for between-case differences in effect sizes in those cases where the researchers have used a multiple baseline design and treatment intensities similar to those reported in this paper (5 to 19 one-hour sessions typically delivered 3 times per week).

The question I am asking myself is whether the distribution of effect sizes that is reported in this paper is helpful to clinicians who are concerned with the practical significance of these studies. I ask this because I am starting to see manuscripts reporting clinical case studies in which the data are used to claim “large treatment effects” for a single case (using Gierut et al’s standard of an effect size of 6.32 or greater). Indeed, in the clinical setting SLPs will be asked to consider whether their clients are making “enough” progress. For example, in Rvachew and Nowak (2001) we asked parents to rate their agreement with the statement “My child’s communication skills are improving as fast as can be expected.” (This question was on our standard patient satisfaction questionnaire so in fact, we asked every parent this question, not just the ones in this RCT). But the parent responses in the RCT showed that there were significant between group differences in response to this question that aligned with the dramatic differences in child response to the traditional versus complexity approach to target selection that was tested in that study (e.g., 34% vs. 17% of targets mastered in these groups respectively). It seems to me that when a parent asks themselves this question they have multiple frames of reference: not only do they consider the child’s communicative competence before and after the introduction of therapy, they consider whether their child would make more or less change with other hypothetical SLPs and other treatment approaches, given that parents actually have choices about these things. Therefore, an effect size that says effectively, the child made more progress with treatment compared to no treatment is not really answering the parent’s question. However, with a group design it is possible to calculate an effect size that reflects change relative to the average amount of change one might expect, given therapy. To my mind this kind of effect size comes closer to answering the questions about practical significance that a parent or employer might ask.

This still leaves us with the question of what kind of change to describe. It is unfortunate that there are few if any controlled studies that have reported functional measures. I can think of some examples of descriptive studies that reported functional measures however. First, Campbell (1999) reported that good functional outcomes were achieved when preschoolers with moderate and severe Speech Delay received twice-weekly therapy over a 90- to 120-day period (i.e., on average the children’s speech intelligibility improved from approximately 50% to 75% intelligible as reported by parents). Second, there are a number of studies reporting ASHA-NOMS (functional communication measures provided by treating SLPs) for children receiving speech and language therapy. However, Thomas-Stonell et al (2007) found that improvement on the ASHA-NOMS was not as sensitive as parental reports of “real life communication change” over a 3 to 6 month interval. Therefore, Thomas-Stonell and her colleagues developed the FOCUS to document parental reports of functional outcomes in a reliable and standardized manner.

Thomas-Stonell et al (2013) report changes in FOCUS scores for 97 preschool aged children who received an average of 9 hours of SLP service in Canada, comparing change during the waiting period (60 day interval) to change during the treatment period (90 day interval). FOCUS assessments demonstrated significantly more change during treatment (about 18 FOCUS points on average) than during the wait period (about 6 FOCUS points on average). Then they compared minimally important changes in PCC, the Children’s Speech Intelligibility Measure, and FOCUS scores for 28 preschool aged children. The FOCUS measure was significantly correlated with the speech accuracy and intelligibility measures but there was not perfect agreement among these measures. For example, 21/28 children obtained a minimally important change of at least 16 points on the FOCUS but 4 of those children did not show significant change on PCC/CSIM. In other words speech accuracy, speech intelligibility and functional improvements are related but not completely aligned; each provides independent information about change over time.

In controlled studies, some version of percent consonants correct is a very common treatment outcome that is used  to assess the efficacy of phonology therapy. Gierut et al (2015) focused specifically on change in those phonemes that are late developing and produced with very low accuracy, if not completely absent from the child’s repertoire at intake. This strikes me as a defensible measure of treatment outcome. Regardless of whether one chooses to treat a complex sound, an early developing sound, a medium-difficulty sound (or one of each as I demonstrated in a previous blog), presumably the SLP wants to have dramatic effects across the child’s phonological system. Evidence that the child is adding new sounds to the repertoire is a good indicator of that kind of change. Alternatively the SLP might count increases in correct use of all consonants that were potential treatment targets prior to the onset of treatment. Or, the SLP could count percent consonants correct for all the consonants because this measure is associated with intelligibility and takes into account the fact that there can be regressions in previously mastered sounds when phonological reorganization is occurring. The number of choices suggests that it would be valuable to have effect size data for a number of possible indicators of change. More to the point, Gierut et al’s single subject effect size implies that almost any change above “no change” is an acceptable level of change in a population that receives intervention because they are stalled without it. I am curious to know if this is a reasonable position to take. In my next blog post I will report effect sizes for these speech accuracy measures taken from my own studies going back to 2001. I will also discuss the clinical significance of the effect sizes that I will aggregate. I am going to calculate the effect size for paired mean differences along with the corresponding confidence intervals for groups of preschoolers treated in three different studies. I haven’t done the calculations yet, so, for those readers who are at all interested in this, you can hold your breath with me.


Campbell, T. F. (1999). Functional treatment outcomes in young children with motor speech disorders. In A. Caruso & E. A. Strand (Eds.), Clinical Management of Motor Speech Disorders in Children (pp. 385-395). New York: Thieme Medical Publishers, Inc.

Gierut, J. A., Morrisette, M. L., & Dickinson, S. L. (2015). Effect Size for Single-Subject Design in Phonological Treatment. Journal of Speech, Language, and Hearing Research, 58(5), 1464-1481. doi:10.1044/2015_JSLHR-S-14-0299

Lakens, D. (2013). Calculating and reporting effect sizes to facilitate cumulative science: A practical primer for t-tests and ANOVAs. Frontiers in Psychology, 4, 1-12. doi:10.3389/fpsyg.2013.00863

Thomas-Stonell, N., McConney-Ellis, S., Oddson, B., Robertson, B., & Rosenbaum, P. (2007). An evaluation of the responsiveness of the pre-kindergarten ASHA NOMS. Canadian Journal of Speech-Language Pathology and Audiology, 31(2), 74-82.

Thomas-Stonell, N., Oddson, B., Robertson, B., & Rosenbaum, P. (2013). Validation of the Focus on the Outcomes of Communication under Six outcome measure. Developmental Medicine and Child Neuroloogy, 55(6), 546-552. doi:10.1111/dmcn.12123

Rvachew, S., & Nowak, M. (2001). The effect of target selection strategy on sound production learning. Journal of Speech, Language, and Hearing Research, 44, 610-623.





Using effect sizes to choose a speech therapy approach

I am quite intrigued by the warning offered by Adrian Simpson in his paper “The misdirection of public policy: comparing and combining standardized effect sizes

The context for the paper is the tendency of public policy makers to rely on meta-analyses to make decisions such as, for example, should we improve teachers’ feedback skills or reduce class sizes as a means of raising student performance? Simpson shows that that meta-analyses (and meta-analyses of the meta-analyses!) are a poor tool for making these apples to oranges comparisons and cannot be relied upon as a source of information when making public policy decisions such as this. He identifies three specific issues with research design that invalidate the combining and comparing of effect sizes. I think that these are good issues to keep in mind when considering effect sizes as a clue to treatment efficacy and a source of information when choosing a speech or language therapy approach.

Recall that an effect size is a standardized mean difference, whereby the difference between means (i.e., the mean outcome of the treatment condition versus the mean outcome of the control condition) is expressed in standard deviation units. The issue is that the standard deviation units, which are supposed to reflect the variation in outcome scores between participants in the intervention trial, actually reflect many different aspects of the research design. Therefore if you compare the effect size of an intervention as obtained in one treatment trial with the effect size for another intervention as obtained in a different treatment trial, you cannot be sure that the difference is due to differences in the relative effectiveness of the two treatments. And yet, SLPs are asking themselves these kinds of questions every day: should I use a traditional articulation therapy approach or a phonological approach? Should I add nonspeech oral motor exercises to my traditional treatment protocol? Is it more efficient to focus on expressive language or receptive language goals? Should I use a parent training approach or direct therapy? And so on. Why is it unsafe to combine and compare effect sizes across studies to make these decisions?

The first issue that Simpson raises is that of comparison groups. Many, although not all, treatment trials compare an experimental intervention to either a ‘no treatment’ control group or a ‘usual care’ condition. The characteristics of the ‘no treatment’ and ‘usual care’ controls are inevitably poorly described if at all. And yet meta-analyses will combine effect sizes across many studies despite having a very poor sense of what the control condition is in the studies that are included in the final estimate of treatment effect. Control group and intervention descriptions can be so paltry that in some cases the experimental treatment of one study may be equivalent to the control condition of another study. The Law et al. (2003) review combined effect sizes for a number of RCTs evaluating phonological interventions. One intervention compared a treatment that was provided in 22 twice-weekly half hours sessions over a four month period to a wait list control (Almost & Rosenbaum, 1998). Another intervention involved monthly 45 minute sessions provided over 8 months, in comparison to a “watchful waiting” control in which many parents “dropped out” of the control condition (Glogowska et al. 2000). Inadequate information was provided about how much intervention the control group children accessed while they waited – almost anything is possible relative to the experimental condition in the Glogowska trial. For example, Yoder et al. (2005) observed that their control group actually accessed more treatment than the kids in their experimental treatment group which maybe explains why they did not obtain a main effect of their intervention (or not, who knows?). The point is that it is hard to know whether a small effect size in comparison to a robust control is more or less impressive than a large effect size in comparison to no treatment at all. Certainly, the comparison is not fair.

The second issue raised concerns range restriction in the population of interest. I realize now that I failed to take this into account when I repeated (in Rvachew & Brosseau-Lapré, 2018) the conclusion that dialogic reading interventions are more effective for low-income children than children with developmental language impairments (Mol et al., 2008). Effect sizes are inflated when the intervention is provided to only a restricted part of the population, and the selection variables are associated with the study outcomes. However, the inflation is greatest for the children near the middle of the distribution and least for children at the tails of the distribution. This fact may explain why effect sizes for vocabulary size after dialogic reading intervention are highest for middle class children (.58, Whitehurst et al. 1988), in the middle for lower class but normally developing children (.33, Lonigan & Whitehurst, 1998), and lowest for children with language impairments (.13, Crain-Thoreson & Dale, 1999). There are other potential explanatory factors in these studies but this issue with restricted range is an important variable that is of obvious importance in treatment trials directed at children with speech and language impairments. The low effect size for dialogic reading obtained by Crain-Thoreson & Dale should not by itself discourage use of dialogic reading with this population.

Finally, measurement validity plays a huge role with longer more valid tests improving effect sizes in comparison to shorter less valid tests. This might be important when comparing the relative effectiveness of therapy for different types of goals. Law et al. (2003) concluded that phonology therapy appeared to be more effective than therapy for syntax goals for example. For some reason the outcome measures in these two groups of studies tend to be very different. Phonology outcomes are typically assessed with picture naming tasks that include 25 to 100 items, with the outcome often expressed as percent consonants correct and therefore at the consonant level there are many items contributing to the test score. Sometimes the phonology outcome measure is created specifically to probe the child’s progress on the specific target of the phonology intervention. In both cases the outcome measure is likely to be a sensitive measure of the outcomes of the intervention. Surprisingly, in Law et al., the outcome of the studies of syntax interventions were quite often omnibus measures of language functioning, such as the Preschool Language Scale, or worse the Reynell Developmental Language Scale, neither test containing many items targeted specifically at the domain of the experimental intervention. When comparing effect sizes across studies, it is crucial to be sure that the outcome measures have equal reliability and validity as measures of the outcomes of interest.

My conclusion is that it is important to not make a fetish of meta-analyses and effect sizes. These kinds of studies provide just one kind of information that should be taken into account when making treatment decisions. Their value is only as good as the underlying research—overall, effect sizes are most trustworthy when they come from the same study or a series of studies involving the exact same independent and dependent variables and the same study population. Given that this is a rare occurrence in speech and language research, there is no real substitute for a deep knowledge of an entire literature on any given subject. Narrative reviews from “experts” (a much maligned concept!) still have a role to play.


Almost, D., & Rosenbaum, P. (1998). Effectiveness of speech intervention for phonological disorders: a randomized controlled trial. Developmental Medicine and Child Neuroloogy, 40, 319-325.

Crain-Thoreson, C., & Dale, P. S. (1999). Enhancing linguistic performance: Parents and teachers as book reading partners for children with language delays. Topics in Early Childhool Special Education, 19, 28-39.

Glogowska, M., Roulstone, S., Enderby, P., & Peters, T. (2000). Randomised controlled trial of community based speech and language therapy in preschool children. British Medical Journal, 321, 923-928.

Law, J., Garrett, Z., & Nye, C. (2003). Speech and language therapy interventions for children with primary speech and language delay or disorder (Cochrane Review). Cochrane Database of Systematic Reviews, Issue 3. Art. No.: CD004110. doi:10.1002/14651858.CD004110.

Lonigan, C. J., & Whitehurst, G. J. (1998). Relative efficacy of a parent teacher involvement in a shared-reading intervention for preschool children from low-income backgrounds. Early Childhood Research Quarterly, 13(2), 263-290.

Mol, S. E., Bus, A. G., de Jong, M. T., & Smeeta, D. J. H. (2008). Added value of dialogic parent-child book readings: A meta-analysis. Early Education and Development, 19, 7-26.

Rvachew, S., & Brosseau-Lapré, F. (2018). Developmental Phonological Disorders: Foundations of Clinical Practice (Second Edition). San Diego, CA: Plural Publishing.

Simpson, A. (2017). The misdirection of public policy: comparing and combining standardised effect sizes. Journal of Education Policy, 1-17. doi:10.1080/02680939.2017.1280183

Whitehurst, G. J., Falco, F., Lonigan, C. J., Fischel, J. E., DeBaryshe, B. D., Valdez-Menchaca, M. C., & Caulfield, M. (1988). Accelerating language development through picture book reading. Developmental Psychology, 24, 552-558.

Yoder, P. J., Camarata, S., & Gardner, E. (2005). Treatment effects on speech intelligibility and length of utterance in children with specific language and intelligibility impairments. Journal of Early Intervention, 28(1), 34-49.