Measuring speech rhythm through automated analysis of syllabic prominences
Résumé
Perceived fluency refers to the smoothness of speech delivery (Lickley, 2015). A key element contributing to fluency is speech rhythm, defined as the recurrence of perceivable temporal patterns of strong and weak elements over time (Gibbon & Gut, 2001). English is commonly categorized as a stress-timed language, where stressed syllables contrast with unstressed ones, and content words tend to be stressed while function words are reduced. This contrasted pattern at syllabic and lexical levels helps listeners in segmenting speech and focusing on essential information (Cutler, 2015).
Learning English as a foreign language thus implies correctly stressing words for ease of understanding (Isaacs et al., 2017). This can be particularly challenging when the learner’s L1 has a different rhythmic system. For instance, Japanese is characterized by a mora-timed rhythm, where each mora has a regular duration (Mihara & Takami, 2013). Thus, Japanese-English speech (JE) may exhibit different rhythmic patterns compared to Native English speech (NE). Specifically, we hypothesize that JE, compared to NE, demonstrates 1) lower prosodic contrast between syllables within polysyllabic words, and 2) a less pronounced difference between content and function words.
We tested these hypotheses using a 34-hour read-aloud corpus containing 877 JE and 91 NE samples. The JE samples were recordings of 42 Japanese university students with English proficiency levels ranging from CEFR A1 or below to B2, while the NE samples were recordings of 7 professional narrators. We aligned the reference texts using MFA3.0 (McAuliffe et al., 2017) and analyzed syllabic prominence with an adapted version of PLSPP (Pauses and Lexical Stress Processing Pipeline, Coulange et al., 2024). This pipeline uses syntactic analysis and speaker-normalized measures of pitch, intensity, and duration of each vowel interval to characterize the accuracy and degree of prominence of syllables in polysyllabic words. We extended the measures to monosyllabic words and compared how content and function words were pronounced. A manual evaluation of word-level alignment precision across the entire corpus showed 92.24% accuracy for NE and 79.67% for JE. Excluding words with incorrect syllable count (29%) had minimal impact on the results. Thus, we report findings based on the full corpus.
Two types of prosodic scores were calculated: Syllabic Contrast Scores between stressed and unstressed syllables within polysyllabic words (Figure 1), and Lexical Contrast Scores between content and function words, including monosyllabic words (Figure 2). These scores were compared according to the speakers' English proficiency levels. Repeated measures ANOVAs with Holm's Sequentially Rejective Bonferroni correction revealed significant tendencies: higher CEFR levels in JE corresponded with higher syllabic and lexical contrast scores, gradually approaching those of NE. Notably, duration was the strongest indicator of lexical contrast, suggesting that lower-level learners are more influenced by the rhythm of their L1.
This study proposed a method to automatically measure the degree of prosodic contrast between syllables to characterize the influence of L1 rhythm on L2 English. The method proved promising for assessing English speech rhythm across various proficiency levels. Our next step is to conduct stress analysis at the sentence level to investigate how learners emphasize essential information within their utterances.
Fichier principal
Measuring_Speech_Rhythm_through_Automated_Analysis_of_Syllabic_Prominences.pdf (151.49 Ko)
Télécharger le fichier
Nakanishi_Leiden_2024.pdf (2.79 Mo)
Télécharger le fichier
Origine | Fichiers produits par l'(les) auteur(s) |
---|
Origine | Fichiers produits par l'(les) auteur(s) |
---|