Change search
ReferencesLink to record
Permanent link

Direct link
On mispronunciation analysis of individual foreign speakers using auditory periphery models
KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH. (Centre for Speech Technology)
KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH. (Centre for Speech Technology)ORCID iD: 0000-0002-3323-5311
KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH. (Centre for Speech Technology)
2013 (English)In: Speech Communication, ISSN 0167-6393, Vol. 55, no 5, 691-706 p.Article in journal (Refereed) Published
Abstract [en]

In second language (L2) learning, a major difficulty is to discriminate between the acoustic diversity within an L2 phoneme category and that between different categories. We propose a general method for automatic diagnostic assessment of the pronunciation of nonnative speakers based on models of the human auditory periphery. Considering each phoneme class separately, the geometric shape similarity between the native auditory domain and the non-native speech domain is measured. The phonemes that deviate the most from the native pronunciation for a set of L2 speakers are detected by comparing the geometric shape similarity measure with that calculated for native speakers on the same phonemes. To evaluate the system, we have tested it with different non-native speaker groups from various language backgrounds. The experimental results are in accordance with linguistic findings and human listeners' ratings, particularly when both the spectral and temporal cues of the speech signal are utilized in the pronunciation analysis.

Place, publisher, year, edition, pages
2013. Vol. 55, no 5, 691-706 p.
Keyword [en]
second language learning, auditory model, distortion measure, perceptual assessment, pronunciation error detection, phoneme
National Category
Signal Processing Other Computer and Information Science
URN: urn:nbn:se:kth:diva-102319DOI: 10.1016/j.specom.2013.01.004ISI: 000318744800008ScopusID: 2-s2.0-84876245465OAI: diva2:552323
Swedish Research Council, 80449001

QC 20130614. Updated from submitted to published.

Available from: 2012-09-13 Created: 2012-09-13 Last updated: 2013-06-14Bibliographically approved
In thesis
1. Perceptually motivated speech recognition and mispronunciation detection
Open this publication in new window or tab >>Perceptually motivated speech recognition and mispronunciation detection
2012 (English)Doctoral thesis, comprehensive summary (Other academic)
Abstract [en]

This doctoral thesis is the result of a research effort performed in two fields of speech technology, i.e., speech recognition and mispronunciation detection. Although the two areas are clearly distinguishable, the proposed approaches share a common hypothesis based on psychoacoustic processing of speech signals. The conjecture implies that the human auditory periphery provides a relatively good separation of different sound classes. Hence, it is possible to use recent findings from psychoacoustic perception together with mathematical and computational tools to model the auditory sensitivities to small speech signal changes.

The performance of an automatic speech recognition system strongly depends on the representation used for the front-end. If the extracted features do not include all relevant information, the performance of the classification stage is inherently suboptimal. The work described in Papers A, B and C is motivated by the fact that humans perform better at speech recognition than machines, particularly for noisy environments. The goal is to make use of knowledge of human perception in the selection and optimization of speech features for speech recognition. These papers show that maximizing the similarity of the Euclidean geometry of the features to the geometry of the perceptual domain is a powerful tool to select or optimize features. Experiments with a practical speech recognizer confirm the validity of the principle. It is also shown an approach to improve mel frequency cepstrum coefficients (MFCCs) through offline optimization. The method has three advantages: i) it is computationally inexpensive, ii) it does not use the auditory model directly, thus avoiding its computational cost, and iii) importantly, it provides better recognition performance than traditional MFCCs for both clean and noisy conditions.

The second task concerns automatic pronunciation error detection. The research, described in Papers D, E and F, is motivated by the observation that almost all native speakers perceive, relatively easily, the acoustic characteristics of their own language when it is produced by speakers of the language. Small variations within a phoneme category, sometimes different for various phonemes, do not change significantly the perception of the language’s own sounds. Several methods are introduced based on similarity measures of the Euclidean space spanned by the acoustic representations of the speech signal and the Euclidean space spanned by an auditory model output, to identify the problematic phonemes for a given speaker. The methods are tested for groups of speakers from different languages and evaluated according to a theoretical linguistic study showing that they can capture many of the problematic phonemes that speakers from each language mispronounce. Finally, a listening test on the same dataset verifies the validity of these methods.

Place, publisher, year, edition, pages
Stockholm: KTH Royal Institute of Technology, 2012. xxi, 79 p.
, Trita CSC-A, ISSN 1653-5723 ; 2012:10
feature extraction, feature selection, auditory models, MFCCs, speech recognition, distortion measures, perturbation analysis, psychoacoustics, human perception, sensitivity matrix, pronunciation error detection, phoneme, second language, perceptual assessment
National Category
Computer Science Signal Processing Media and Communication Technology Other Computer and Information Science
urn:nbn:se:kth:diva-102321 (URN)978-91-7501-468-5 (ISBN)
Public defence
2012-10-05, A2, Östermalmsgatan 26, KTH, Stockholm, 10:00 (English)
European Union FP6-034362 research project ACORNSComputer-Animated language Teachers (CALATea)

QC 20120914

Available from: 2012-09-14 Created: 2012-09-13 Last updated: 2012-09-14Bibliographically approved

Open Access in DiVA

No full text

Other links

Publisher's full textScopus

Search in DiVA

By author/editor
Koniaris, ChristosSalvi, GiampieroEngwall, Olov
By organisation
Speech, Music and Hearing, TMH
In the same journal
Speech Communication
Signal ProcessingOther Computer and Information Science

Search outside of DiVA

GoogleGoogle Scholar
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

Altmetric score

Total: 159 hits
ReferencesLink to record
Permanent link

Direct link