Using effect sizes to choose a speech therapy approach

I am quite intrigued by the warning offered by Adrian Simpson in his paper “The misdirection of public policy: comparing and combining standardized effect sizes

The context for the paper is the tendency of public policy makers to rely on meta-analyses to make decisions such as, for example, should we improve teachers’ feedback skills or reduce class sizes as a means of raising student performance? Simpson shows that that meta-analyses (and meta-analyses of the meta-analyses!) are a poor tool for making these apples to oranges comparisons and cannot be relied upon as a source of information when making public policy decisions such as this. He identifies three specific issues with research design that invalidate the combining and comparing of effect sizes. I think that these are good issues to keep in mind when considering effect sizes as a clue to treatment efficacy and a source of information when choosing a speech or language therapy approach.

Recall that an effect size is a standardized mean difference, whereby the difference between means (i.e., the mean outcome of the treatment condition versus the mean outcome of the control condition) is expressed in standard deviation units. The issue is that the standard deviation units, which are supposed to reflect the variation in outcome scores between participants in the intervention trial, actually reflect many different aspects of the research design. Therefore if you compare the effect size of an intervention as obtained in one treatment trial with the effect size for another intervention as obtained in a different treatment trial, you cannot be sure that the difference is due to differences in the relative effectiveness of the two treatments. And yet, SLPs are asking themselves these kinds of questions every day: should I use a traditional articulation therapy approach or a phonological approach? Should I add nonspeech oral motor exercises to my traditional treatment protocol? Is it more efficient to focus on expressive language or receptive language goals? Should I use a parent training approach or direct therapy? And so on. Why is it unsafe to combine and compare effect sizes across studies to make these decisions?

The first issue that Simpson raises is that of comparison groups. Many, although not all, treatment trials compare an experimental intervention to either a ‘no treatment’ control group or a ‘usual care’ condition. The characteristics of the ‘no treatment’ and ‘usual care’ controls are inevitably poorly described if at all. And yet meta-analyses will combine effect sizes across many studies despite having a very poor sense of what the control condition is in the studies that are included in the final estimate of treatment effect. Control group and intervention descriptions can be so paltry that in some cases the experimental treatment of one study may be equivalent to the control condition of another study. The Law et al. (2003) review combined effect sizes for a number of RCTs evaluating phonological interventions. One intervention compared a treatment that was provided in 22 twice-weekly half hours sessions over a four month period to a wait list control (Almost & Rosenbaum, 1998). Another intervention involved monthly 45 minute sessions provided over 8 months, in comparison to a “watchful waiting” control in which many parents “dropped out” of the control condition (Glogowska et al. 2000). Inadequate information was provided about how much intervention the control group children accessed while they waited – almost anything is possible relative to the experimental condition in the Glogowska trial. For example, Yoder et al. (2005) observed that their control group actually accessed more treatment than the kids in their experimental treatment group which maybe explains why they did not obtain a main effect of their intervention (or not, who knows?). The point is that it is hard to know whether a small effect size in comparison to a robust control is more or less impressive than a large effect size in comparison to no treatment at all. Certainly, the comparison is not fair.

The second issue raised concerns range restriction in the population of interest. I realize now that I failed to take this into account when I repeated (in Rvachew & Brosseau-Lapré, 2018) the conclusion that dialogic reading interventions are more effective for low-income children than children with developmental language impairments (Mol et al., 2008). Effect sizes are inflated when the intervention is provided to only a restricted part of the population, and the selection variables are associated with the study outcomes. However, the inflation is greatest for the children near the middle of the distribution and least for children at the tails of the distribution. This fact may explain why effect sizes for vocabulary size after dialogic reading intervention are highest for middle class children (.58, Whitehurst et al. 1988), in the middle for lower class but normally developing children (.33, Lonigan & Whitehurst, 1998), and lowest for children with language impairments (.13, Crain-Thoreson & Dale, 1999). There are other potential explanatory factors in these studies but this issue with restricted range is an important variable that is of obvious importance in treatment trials directed at children with speech and language impairments. The low effect size for dialogic reading obtained by Crain-Thoreson & Dale should not by itself discourage use of dialogic reading with this population.

Finally, measurement validity plays a huge role with longer more valid tests improving effect sizes in comparison to shorter less valid tests. This might be important when comparing the relative effectiveness of therapy for different types of goals. Law et al. (2003) concluded that phonology therapy appeared to be more effective than therapy for syntax goals for example. For some reason the outcome measures in these two groups of studies tend to be very different. Phonology outcomes are typically assessed with picture naming tasks that include 25 to 100 items, with the outcome often expressed as percent consonants correct and therefore at the consonant level there are many items contributing to the test score. Sometimes the phonology outcome measure is created specifically to probe the child’s progress on the specific target of the phonology intervention. In both cases the outcome measure is likely to be a sensitive measure of the outcomes of the intervention. Surprisingly, in Law et al., the outcome of the studies of syntax interventions were quite often omnibus measures of language functioning, such as the Preschool Language Scale, or worse the Reynell Developmental Language Scale, neither test containing many items targeted specifically at the domain of the experimental intervention. When comparing effect sizes across studies, it is crucial to be sure that the outcome measures have equal reliability and validity as measures of the outcomes of interest.

My conclusion is that it is important to not make a fetish of meta-analyses and effect sizes. These kinds of studies provide just one kind of information that should be taken into account when making treatment decisions. Their value is only as good as the underlying research—overall, effect sizes are most trustworthy when they come from the same study or a series of studies involving the exact same independent and dependent variables and the same study population. Given that this is a rare occurrence in speech and language research, there is no real substitute for a deep knowledge of an entire literature on any given subject. Narrative reviews from “experts” (a much maligned concept!) still have a role to play.


Almost, D., & Rosenbaum, P. (1998). Effectiveness of speech intervention for phonological disorders: a randomized controlled trial. Developmental Medicine and Child Neuroloogy, 40, 319-325.

Crain-Thoreson, C., & Dale, P. S. (1999). Enhancing linguistic performance: Parents and teachers as book reading partners for children with language delays. Topics in Early Childhool Special Education, 19, 28-39.

Glogowska, M., Roulstone, S., Enderby, P., & Peters, T. (2000). Randomised controlled trial of community based speech and language therapy in preschool children. British Medical Journal, 321, 923-928.

Law, J., Garrett, Z., & Nye, C. (2003). Speech and language therapy interventions for children with primary speech and language delay or disorder (Cochrane Review). Cochrane Database of Systematic Reviews, Issue 3. Art. No.: CD004110. doi:10.1002/14651858.CD004110.

Lonigan, C. J., & Whitehurst, G. J. (1998). Relative efficacy of a parent teacher involvement in a shared-reading intervention for preschool children from low-income backgrounds. Early Childhood Research Quarterly, 13(2), 263-290.

Mol, S. E., Bus, A. G., de Jong, M. T., & Smeeta, D. J. H. (2008). Added value of dialogic parent-child book readings: A meta-analysis. Early Education and Development, 19, 7-26.

Rvachew, S., & Brosseau-Lapré, F. (2018). Developmental Phonological Disorders: Foundations of Clinical Practice (Second Edition). San Diego, CA: Plural Publishing.

Simpson, A. (2017). The misdirection of public policy: comparing and combining standardised effect sizes. Journal of Education Policy, 1-17. doi:10.1080/02680939.2017.1280183

Whitehurst, G. J., Falco, F., Lonigan, C. J., Fischel, J. E., DeBaryshe, B. D., Valdez-Menchaca, M. C., & Caulfield, M. (1988). Accelerating language development through picture book reading. Developmental Psychology, 24, 552-558.

Yoder, P. J., Camarata, S., & Gardner, E. (2005). Treatment effects on speech intelligibility and length of utterance in children with specific language and intelligibility impairments. Journal of Early Intervention, 28(1), 34-49.