L2 learning of new phonetic contrasts: How hard is that?

Given my career long interest in the impact of perceptual knowledge on speech production learning it was gratifying to read the meta-analysis by Sakai and Moorman(2018) that concluded “Ultimately, the present meta-analysis was able to show that perception-only training can lead to production gains. This finding is encouraging to L2 instructors and learners.” They found that, when teaching adult L2 learners to perceive a foreign language phonetic contrast, an average effect size of .92 was obtained for gains in perception while an average effect size of .54 was obtained for gains in production accuracy for the same phonetic contrast. They reviewed studies in which no production practice or training was provided and therefore these changes in production accuracy were a direct effect of the perceptual training procedure.

Francoise and I were puzzled by one part of their paper however. They excluded 12 studies because they claimed that they were unable to obtain the data required for the calculation of effect sizes from the authors. Our prior study on training English speakers to perceive French vowels was excluded on this basis despite the fact that I am right here with the data all neat and tidy in spread sheets (I guess these things happen although it is the second time that this has happened to me now so it is getting to be annoying). Nonetheless, we have all the data required to calculate those effect sizes so I provide them here for each group. There were seven conditions in our study, manipulating variability in the talkers (multiple versus single talker) and the position of the training vowels on the continuum (far from the category boundary, prototypical location in category space, and close to the category boundary). Given a control condition in which listeners categorized grammatical items rather than vowel tokens, we have seven conditions: control (CON), single voice prototype (SVP), multiple voice prototype (MVP), single voice far (SVF), multiple voice far (MVF), single voice close (SVC), multiple voice close (MVC). I provide the effect sizes below calculated as described by Sakai et al. The figures our paper reflect our statistical analysis which indicate a reliable effect of the training on perception in the MVF and MVC conditions but to our disappointment no reliable effect on production. All groups improved production (including the control group) when acoustic measures were considered, but these acoustic changes were not perceptible to native French listeners, as in there was no significant effect of time (pre to post training) or condition and no interaction when we submitted listener ratings of the participants’ production effects to a repeated measures ANOVA. Nonetheless, some moderate effect sizes are seen below in the SVC and MVC conditions, relative to the CON condition. Two effect sizes are reported for the perception outcomes and the production outcomes: ES(PP) which reflects pre- to post-training changes and ES(PPC) which reflects the difference between the change observed in the experimental group versus the control group.

Brosseau Lapre et al Applied Psycholinguistics perception ES

This table reports similar findings to those described by Sakai and Moorman in that the ES(PPC) is considerably larger than the ES(PP) although this latter ES is smaller than the mean ES that they found. However, we had not expected all of our conditions to be equally effective. Overall, we concluded that training with multiple talkers was most effective when listeners were presented with a range of vowel stimuli that were far from the category boundary; however, training with a single voice was most effective when listeners were presented with vowel stimuli that were close to the category boundary.

Regarding production outcomes however, the figures with error bars in our paper really tell the story most clearly, showing no reliable effect of the perception training on production outcomes from the listener’s perceptive despite large changes in acoustic parameters for all groups, including the control group, suggesting that the perception testing alone (as conducted pre and post training) has at least a short term effect on production.

Brosseau Lapre et al Applied Psycholinguistics production ES

With respect to changes in production, these effect size data might suggest an effect of perception training on production in the single voice close and multiple voice close conditions but overall there were no statistically significant findings for production accuracy and despite small improvements in the SVC and MVC conditions, the average ratings are not good. This is one issue I always have with meta-analyses: they are concerned with the size of “effects” as measured by d but the d values do not tell you whether the “effects” in any of the studies so aggregated were actual effects as in statistically or even functionally significant. Now, theoretically, if you aggregate a lot of moderate effect sizes from a lot of underpowered studies, they could add up to something but in this case I think we have a picture of an effect that is very idiosyncratic. We really don’t know why some L2 participants learn these contrasts and some don’t in the perception or production domains. Sakai and Moorman do us a service by exploring some potential sources of heterogeneity in outcomes. It is possible that many  training sessions targeting about 3 contrasts over a three hour total training period and completed at home may be optimal. Furthermore, in terms of participant characteristics, beginners make more progress than intermediate level learners. Overall however, the characteristics of successful versus unsuccessful learners are not clear despite a growing number of studies that examine underlying perceptual and cognitive skills as predictors. Personally, I find it a bit discouraging to read the Sakai and Moorman paper. The authors were quite excited to find a reliable moderate effect size across a growing number of studies. But I know that those effect sizes are associated with a lot of people who cannot produce a foreign language speech sound that sounds even half-ways “native-like.” After forty years of work involving quite sophisticated methods for designing stimuli and training regimens I thought we would be further along. Definitely more work to do on this problem.



Support for Speech Perception Interventions in Speech Therapy

I am writing a third blog on this strange experimental protocol in which the talker produces a syllable repeatedly and the talker’s speech output is altered in a systematic fashion so that the talker hears him or herself say something that does not correspond to their own articulatory gestures. I am fascinated by these experiments because they are a window onto feedback control which is essential for a successful speech therapy outcome. Initially in traditional speech therapy the SLP is providing a lot of external feedback about the child’s articulatory gestures (knowledge of performance feedback) and the correctness of the child’s speech output (knowledge of results feedback). But given that the SLP cannot follow the child around outside the clinic room, eventually the child must learn to use self-generated feedback for speech motor learning to occur. Can children use auditory feedback to change their own speech?

In a previous blog, On Birds and Speech Therapy, I discussed interesting work from Queen’s University  suggesting that toddlers do not use feedback control like adults do during speech motor learning.  These researchers found that adults will compensate for perturbations of their own speech by adjusting their articulation to get the desired auditory feedback. In contrast, very young children do not compensate in this way. I suggested that this may be because toddlers do not perceive speech with the same degree of precision as adults. This hypothesis was supported by another study in which speakers of French and English did not show the same compensation effect to a perturbation that made their vowels sound like a French vowel. The English talkers did not respond to a perturbation to which they were not perceptually sensitive (see Feedback Control and Speech Therapy Revisited).

Recently, I was delighted to find another study involving children provides even stronger confirmation that perceptual representations play a key role in the child’s ability to use feedback for speech motor learning. Shiller and Rochon (2014)  randomly assigned 5- to 7-year-old children with typical speech to two training conditions: the control group received speech perception training for the /b/-/d/ contrast; the experimental group received speech perception training for the /ɛ/-/æ/ contrast. Prior to and subsequent to this training both groups experienced the perturbation experiment: both groups repeated said “Beb” while their own speech was altered to sound more like “Bab”. Prior to perceptual training, both groups showed a small compensation for this perturbation in the feedback of their own speech. After speech perception training the experimental group showed twice as much compensation as before whereas the control group showed no change in the amount of compensation.  The results show that children can indeed use feedback for speech motor adaptation; furthermore, this ability improves as perceptual boundaries between phoneme categories become better defined —with age or with training.

The conclusions of the study are very gratifying. Citing my own work on the importance of speech perception training as a strategy to facilitate speech production learning by children with speech sound disorders, the authors conclude:

“The results of the present study complement this work nicely, demonstrating that improvements in children’s auditory perceptual abilities do not simply improve motor performance, but also alter the capacity for auditory-feedback based speech motor learning—a process that is central to the clinical treatment of speech production disorders.” (p. 1314)

No surprise that I like this study a lot!

Extraterrestials and Speech Therapy

I am somewhat disheartened to stumble across another paper, based on a case study of a single child, advising speech-language pathologists to focus treatment on speech production while ignoring the child’s obvious difficulties with speech perception (McAllister Byun, 2012). The problem that I have with this paper is not that it is a case study (these can be very useful in clinical and research contexts) but that the conclusions are based entirely on phonetic transcriptions of a child’s speech and of the speech stimuli used to assess the child’s perceptual abilities. I believe that this leads to what is probably an erroneous conclusion about the child’s speech production accuracy. This may be alarming to clinical readers since phonetic transcription is your primary tool for describing children’s phonological knowledge. However, in a recent paper Munson et al (2010) explains why it is not clear that “alien anthropologists would come up with anything remotely like phonetic transcription to characterize human speech”. Extraterrestials with a different communicative apparatus may be better placed to realize that phonetic transcription provides a highly biased and often inaccurate picture of what the child is doing when articulating the phonemes that we are testing in our assessments. Munson and colleagues present compelling data to that effect but concede in the conclusions that the extraterrestials may also come to realize that humans usually don’t have the time or resources to obtain unbiased data via instrumental analyses. However, if we must use phonetic transcription we must at least be aware of the limitations so that we can avoid the error that is made in McAllister Byun’s case study. For I believe that there is a significant error that will do harm to children in speech therapy unless we understand the points made in Munson’s fantasy about extraterrestial anthropologists.

McAllister Byun begins by acknowledging that it is now well accepted that speech perception difficulties are associated with speech production errors which is a good thing because I am real tired of devoting research time to proving that over and over again. In fact I got bored with that question a long time ago and went on to the next step – establishing direction of causality. Theoretically, difficulties with speech production accuracy could precede and cause misperception of speech sounds and in fact I was taught that this was so when I studied speech therapy at the University of Alberta in the 1970s. McAllister Byun updates the idea with an intriguing explanation for this hypothetical effect involving the role of the child’s own productions in the population of exemplars that contribute to the child’s perceptual knowledge of the target phoneme. The clinical implications of this hypothesis (if true) are clear; if the child misarticulates /k/ → [t], teach the child to articulate /k/ correctly and any misperception of the contrast will correct itself. On the other hand, misperception of the /k/-/t/ contrast could precede and cause the failure to acquire the appropriate articulatory gestures for accurate production of the /k/ phoneme. I think that this latter hypothesis makes sense because the infant’s speech perception skills begin to develop at least six months in advance of the production of speech-like articulation (in the form of babble) and therefore I think that speech perception typically precedes speech production development although there is a reciprocal relationship in the acquisition of precision in both domains throughout childhood. This hypothesis is also consistent with the DIVA model of speech motor control as described in Shiller, Rvachew & Brosseau-Lapré (2010). I have supported this hypothesis with four types of studies: (1) linear structural equation modeling showing good fit to the “perception leads production” hypothesis and poor fit for the alternative (Rvachew & Grawburg, 2006); (2) a longitudinal study showing that perception skills predict growth in articulation accuracy but not the reverse (Rvachew, 2006); (3) single subject experiments showing that treating speech perception increases speech production accuracy (Jamieson & Rvachew, 1992); and (4) randomized control trials showing that speech perception and speech production training combined is much more efficient and effective than speech production training alone (for review see Chapter 9, Rvachew & Brosseau-Lapré, 2012). Furthermore, in one of these trials I showed specifically that speech production training did not lead to improved speech perception (Rvachew, 1994). Therefore, I recommend that speech perception and speech production treatment procedures be conducted in parallel, with the “input oriented” activities preceding the “output oriented” activities to a greater or lesser extent depending upon the needs of the child. Should I reconsider these recommendations, based on over 30 years of clinical practice and research findings, after reading McAllister Bryn’s paper? Not at all – let’s look at it carefully.

McAllister Byun describes a 4-year-old boy who was given a “provisional diagnosis of CAS…based on the presence of characteristics including atypical prosody, inconsistent errors and vowel errors…” (p. 402). The child fronted velars in syllable onsets (referred to as “strong position”) but not in syllable codas (referred to as “weak position”). This is thought to be an anomaly because implicational relationships dictate that accuracy in the weak position implies accuracy in the strong position. Redford & Diehl (1999) is cited as evidence for greater perceptual prominence of the onset position (making it the strong position). If you read Redford and Diehl however you find that the adults in their study did not find perception of /k/ to be easier in the onset compared to the coda (these relationships were phoneme specific and therefore gross generalizations about positional prominence should not be made). More to the point, the child’s perception of /k/-/t/ was tested using a perceptual test based on same-different judgments of recorded natural speech stimuli. The results revealed equally poor discrimination performance for the /k/-/t/ contrast in onsets and codas. The author concluded that, in this case, production accuracy was “leading” the child’s acquisition of perceptual knowledge of the contrast. The author further concludes that, for this particular case, the deficit in perception could be attributed to “a primary deficit in production” and therefore “motor-oriented therapy may be optimal”. If you believe that speech development is an “either-or” affair where the phoneme contrast is discriminated or not discriminated in the perceptual domain and the target phoneme is produced correctly or not in the articulatory domain, I suppose that this might make sense. However, speech development is a process of gradually acquiring knowledge of multiple phonetic characteristics that are distributed in a continuous fashion across the category. Studies of children’s phonetic knowledge of phoneme categories show that it is not a safe assumption that this child had achieved articulatory accuracy for /k/ in the coda position in advance of perceptual knowledge of the /k/-t/ contrast.

In our book, Françoise and I stress repeatedly that it is not enough to ask if the child perceives any given contrast. Rather, we want to know “how” the child perceives the contrast: “Phonetic categories are an emergent property of the distribution of acoustic information across parametric phonetic space, built up over time as the language learner stores detailed memory traces of experienced words. Each language learner must discover a strategy for abstracting phonetic structure from the input that is adapted to the nature of the input that is received. Assessing the language learner’s perceptual knowledge requires sophisticated tools that reveal the listener’s perceptual strategies for making sense of highly complex and variable input …” (p. 46). The test used by McAllister Byun clearly does not meet this standard-we have no way of knowing which acoustic cues the child was attending to when completing the task. The acoustic cues for perception of /k/ include all the spectral moments (mean, variance, skewness and kurtosis) that can be measured for the stop burst (Forrest et al. 1990) as well as many acoustic characteristics of the formant transitions that tie the release burst to the vowel (Dorman et al, 1977; Nguyen et al, 2009). Adults and children with normally developing speech differentiate /k/ and /t/ in production  largely on the basis of the spectral mean. Three different patterns are seen among children with speech disorders: (1) they may not differentiate the phonemes at all (i.e., they have no contrast) or (2) they may produce a covert contrast (their /k/ targets are perceived as [t] even though they are acoustically different from /t/ targets) or (3) they may produce a perceptible /k/-/t/ contrast that is differentiated on the basis of nonstandard cues. Nonstandard cues in the latter two situations may include skewness and kurtosis in the burst; alternatively the child may ignore the burst and manipulate slope of the formant frequency transitions. Reliance on non-standard cues or cue-weighting strategies in perception may lead to variable performance in perception and production.

How might a child with incomplete knowledge of the acoustic properties of this contrast achieve perceptually accurate production in codas and inaccurate production in onsets?  Using electropalatography, Gibbon & Wood (2002) describe “articulatory drift” whereby placement of an undifferentiated lingual gesture at onset is different from the placement at release, resulting in variable perceptual outcomes for alveolar and velar targets, such that /t/ → [t, k] and /k/ → [t, k]. Gibbon (1999) demonstrated how a child can learn to control the release phase of the gesture to achieve the contrast without fundamentally changing the undifferentiated lingual gesture itself. In this case, the adult listener believes that the child has acquired the contrast productively but the child’s underlying articulatory patterns continue to be immature.

I actually think it makes sense that the child’s own productions might have some sort of downstream effect on the child’s perception of a phoneme contrasts. Perhaps McAllister Byun’s case is an example of that, especially given the “provisional diagnosis of CAS” in this case. However, the assessment information provided is inadequate to prove the hypothesis. We do not know which acoustic cues the child attended to when differentiating /k/ from /t/ in the perceptual domain. We do not know the topography of the child’s articulatory gestures when producing the contrast given that the primary data in the paper is phonetic transcription. In our book Françoise and I describe cases like McAllister Byun’s who received “motor oriented therapy” and failed to make measurable progress in therapy over three years! My interpretation of this case is that the child was probably attending to the formant frequency transitions in perception which results in erratic perceptual performance in both onset and coda. Productively the child may manipulate the timing of the release of the undifferentiated lingual gesture so as to produce [t] in the onset but a perceptually accurate but phonetically inaccurate [k] in the coda. His phonetic knowledge of the contrast is incomplete in the perceptual and articulatory domains in both onset and the coda. The treatment program needs to address his perceptual, articulatory and phonological knowledge of the /k/ phoneme. SLPs, not having access to EPG and speech synthesizers and other research tools for precisely mapping the child’s phonetic knowledge at all levels of phonological representation, can only guess as to the status of the child’s knowledge in these domains. The safest assumption is that the child’s knowledge is incomplete at all levels and the most prudent course of action is to address all three. Your therapy will be more effective and efficient in the long run.


Dorman, M., M. Studdert-Kennedy, et al. (1977). “Stop-consonant recognition: Release bursts and formant transitions as functionally equivalent, context-dependent cues.” Attention, Perception, & Psychophysics 22(2): 109-122. http://www.springerlink.com/content/8583238315777761/

Forrest, K., G. Weismer, et al. (1990). “Statistical analysis of word-initial /k/ and /t/ produced by normal and phonologically disordered children.” Clinical Linguistics & Phonetics 4(4): 327-340. http://informahealthcare.com/doi/abs/10.3109/02699209008985495

Gibbon, F. E. (1999). “Undifferentiated lingual gestures in children with articulation/phonological disorders.” Journal of Speech, Language, and Hearing Research 42: 382-397. http://bit.ly/Nj4VIf

Gibbon, F. and S. E. Wood (2002). “Articulatory drift in the speech of children with articulation and phonological disorders.” Perceptual and Motor Skills 95: 295-307.

Jamieson, D. G. and S. Rvachew (1992). “Remediation of speech production errors with sound identification training.” Journal of Speech-Language Pathology and Audiology 16: 201-210.[OPEN ACCESS]


McAllister Byun, T. (2012). “Bidirectional perception–production relations in phonological development: evidence from positional neutralization.” Clinical Linguistics & Phonetics 26(5): 397-413.


Nguyen, V. S., E. Castelli, et al. (2009). Vietnamese final stop consonants /p, t, k/ described in terms of formant transition slopes. 2009 International Conference on Asian Language Processing: Recent Advances in Asian Language Processing, IALP 2009. Singapore: 86-90. [OPEN ACCESS


Munson, B., J. Edwards, et al. (2010). “Deconstructing phonetic transcription: Covert contrast, perceptual bias, and an extraterrestrial view of Vox Humana.” Clinical Linguistics & Phonetics 24: 245-260. http://informahealthcare.com/doi/abs/10.3109/02699200903532524

Redford, M. A. and R. L. Diehl (1999). “The relative perceptual distinctiveness of initial and final consonants in CVC syllables.” The Journal of the Acoustical Society of America 106(3): 1555-1565. http://asadl.org/jasa/resource/1/jasman/v106/i3/p1555_s1

Rvachew, S. (1994). “Speech perception training can facilitate sound production learning.” Journal of Speech and Hearing Research 37: 347-357. http://bit.ly/Qt0Piv

Rvachew, S. (2006). “Longitudinal prediction of implicit phonological awareness skills.” American Journal of Speech-Language Pathology 15: 165-176. http://bit.ly/RMcfMZ

Rvachew, S. and F. Brosseau-Lapré (2012). Developmental Phonological Disorders: Foundations of Clinical Practice. San Diego, CA, Plural Publishing, Inc. http://bit.ly/vIliz2

Rvachew, S. and M. Grawburg (2006). “Correlates of phonological awareness in preschoolers with speech sound disorders.” Journal of Speech, Language, and Hearing Research 49: 74-87. http://bit.ly/RsQ2ER

Shiller, D. M., S. Rvachew, et al. (2010). “Importance of the auditory perceptual target to the achievement of speech production accuracy.” Canadian Journal of Speech-Language Pathology and Audiology 34: 181-192. (http://bit.ly/PSlmXk)