Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • harvard1
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Perceptually motivated speech recognition and mispronunciation detection
KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology. (Centre for Speech Technology)
2012 (English)Doctoral thesis, comprehensive summary (Other academic)
Abstract [en]

This doctoral thesis is the result of a research effort performed in two fields of speech technology, i.e., speech recognition and mispronunciation detection. Although the two areas are clearly distinguishable, the proposed approaches share a common hypothesis based on psychoacoustic processing of speech signals. The conjecture implies that the human auditory periphery provides a relatively good separation of different sound classes. Hence, it is possible to use recent findings from psychoacoustic perception together with mathematical and computational tools to model the auditory sensitivities to small speech signal changes.

The performance of an automatic speech recognition system strongly depends on the representation used for the front-end. If the extracted features do not include all relevant information, the performance of the classification stage is inherently suboptimal. The work described in Papers A, B and C is motivated by the fact that humans perform better at speech recognition than machines, particularly for noisy environments. The goal is to make use of knowledge of human perception in the selection and optimization of speech features for speech recognition. These papers show that maximizing the similarity of the Euclidean geometry of the features to the geometry of the perceptual domain is a powerful tool to select or optimize features. Experiments with a practical speech recognizer confirm the validity of the principle. It is also shown an approach to improve mel frequency cepstrum coefficients (MFCCs) through offline optimization. The method has three advantages: i) it is computationally inexpensive, ii) it does not use the auditory model directly, thus avoiding its computational cost, and iii) importantly, it provides better recognition performance than traditional MFCCs for both clean and noisy conditions.

The second task concerns automatic pronunciation error detection. The research, described in Papers D, E and F, is motivated by the observation that almost all native speakers perceive, relatively easily, the acoustic characteristics of their own language when it is produced by speakers of the language. Small variations within a phoneme category, sometimes different for various phonemes, do not change significantly the perception of the language’s own sounds. Several methods are introduced based on similarity measures of the Euclidean space spanned by the acoustic representations of the speech signal and the Euclidean space spanned by an auditory model output, to identify the problematic phonemes for a given speaker. The methods are tested for groups of speakers from different languages and evaluated according to a theoretical linguistic study showing that they can capture many of the problematic phonemes that speakers from each language mispronounce. Finally, a listening test on the same dataset verifies the validity of these methods.

Place, publisher, year, edition, pages
Stockholm: KTH Royal Institute of Technology, 2012. , xxi, 79 p.
Series
Trita CSC-A, ISSN 1653-5723 ; 2012:10
Keyword [en]
feature extraction, feature selection, auditory models, MFCCs, speech recognition, distortion measures, perturbation analysis, psychoacoustics, human perception, sensitivity matrix, pronunciation error detection, phoneme, second language, perceptual assessment
National Category
Computer Science Signal Processing Media and Communication Technology Other Computer and Information Science
Identifiers
URN: urn:nbn:se:kth:diva-102321ISBN: 978-91-7501-468-5 (print)OAI: oai:DiVA.org:kth-102321DiVA: diva2:552336
Public defence
2012-10-05, A2, Östermalmsgatan 26, KTH, Stockholm, 10:00 (English)
Opponent
Supervisors
Projects
European Union FP6-034362 research project ACORNSComputer-Animated language Teachers (CALATea)
Note

QC 20120914

Available from: 2012-09-14 Created: 2012-09-13 Last updated: 2012-09-14Bibliographically approved
List of papers
1. Auditory-model based robust feature selection for speech recognition
Open this publication in new window or tab >>Auditory-model based robust feature selection for speech recognition
2010 (English)In: Journal of the Acoustical Society of America, ISSN 0001-4966, E-ISSN 1520-8524, Vol. 127, no 2, EL73-EL79 p.Article in journal (Refereed) Published
Abstract [en]

 It is shown that robust dimension-reduction of a feature set for speech recognition can be based on a model of the human auditory system. Whereas conventional methods optimize classification performance, the proposed method exploits knowledge implicit in the auditory periphery, inheriting its robustness. Features are selected to maximize the similarity of the Euclidean geometry of the feature domain and the perceptual domain. Recognition experiments using mel-frequency cepstral coefficients (MFCCs) confirm the effectiveness of the approach, which does not require labeled training data. For noisy data the method outperforms commonly used discriminant-analysis based dimension-reduction methods that rely on labeling. The results indicate that selecting MFCCs in their natural order results in subsets with good performance.

Keyword
feature selection, auditory model, sensitivity matrix, speech recognition
National Category
Signal Processing
Identifiers
urn:nbn:se:kth:diva-11467 (URN)10.1121/1.3284545 (DOI)000274322200010 ()2-s2.0-76349109466 (Scopus ID)
Note
QC 20100831 Uppdaterad från submitted till published (20100831)Available from: 2009-11-13 Created: 2009-11-13 Last updated: 2017-12-12Bibliographically approved
2. Auditory model based optimization of MFCCs improves automatic speech recognition performance
Open this publication in new window or tab >>Auditory model based optimization of MFCCs improves automatic speech recognition performance
2009 (English)In: INTERSPEECH 2009: 10TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2009, 2009, 2943-2946 p.Conference paper, Published paper (Refereed)
Abstract [en]

Using a spectral auditory model along with perturbation based analysis, we develop a new framework to optimize a set of features such that it emulates the behavior of the human auditory system. The optimization is carried out in an off-line manner based on the conjecture that the local geometries of the feature domain and the perceptual auditory domain should be similar. Using this principle, we modify and optimize the static mel frequency cepstral coefficients (MFCCs) without considering any feedback from the speech recognition system. We show that improved recognition performance is obtained for any environmental condition, clean as well as noisy.

Keyword
ASR, Auditory model, MFCC
National Category
Signal Processing
Identifiers
urn:nbn:se:kth:diva-11468 (URN)000276842801277 ()2-s2.0-70450221097 (Scopus ID)978-1-61567-692-7 (ISBN)
Conference
10th Annual Conference of the International Speech Communication Association, INTERSPEECH 2009; Brighton; 6 September 2009 - 10 September 2009
Note
QC 20101015Available from: 2009-11-13 Created: 2009-11-13 Last updated: 2012-09-14Bibliographically approved
3. Selecting static and dynamic features using an advanced auditory model for speech recognition
Open this publication in new window or tab >>Selecting static and dynamic features using an advanced auditory model for speech recognition
2010 (English)In: Proceedings 2010 IEEE International Conference on Acoustics, Speech, and Signal Processing, IEEE , 2010, 4590-4593 p.Conference paper, Published paper (Refereed)
Abstract [en]

We describe a method to select features for speech recognition that is based on a quantitative model of the human auditory periphery. The method maximizes the similarity of the geometry of the space spanned by the subset of features and the geometry of the space spanned by the auditory model output. The selection method uses a spectro-temporal auditory model that captures both frequency- and time-domain masking. The selection method is blind to the meaning of speech and does not require annotated speech data. We apply the method to the selection of a subset of features from a conventional set consisting of mel cepstra and their first-order and second-order time derivatives. Although our method uses only knowledge of the human auditory periphery, the experimental results show that it performs significantly better than feature-reduction algorithms based on linear and heteroscedastic discriminant analysis that require training with annotated speech data.

Place, publisher, year, edition, pages
IEEE, 2010
Series
Proceedings of the IEEE international conference on acoustics, speech and signal processing, ISSN 1520-6149
Keyword
feature selection, dimension reduction, auditory model, perception, sensitivity analysis, distortion, speech recognition
National Category
Signal Processing
Identifiers
urn:nbn:se:kth:diva-11469 (URN)10.1109/ICASSP.2010.5495648 (DOI)000287096004068 ()2-s2.0-78049406665 (Scopus ID)978-1-4244-4296-6 (ISBN)
Conference
2010 IEEE International Conference on Acoustics, Speech, and Signal Processing, March 14–19, 2010, Dallas, Texas, U.S.A.
Note
QC 20110415; WINNER OF THE BEST STUDENT PAPER AWARD (1st place).Available from: 2009-11-13 Created: 2009-11-13 Last updated: 2012-09-14Bibliographically approved
4. Perceptual differentiation modeling explains phoneme mispronunciation by non-native speakers
Open this publication in new window or tab >>Perceptual differentiation modeling explains phoneme mispronunciation by non-native speakers
2011 (English)In: ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings, 2011, 5704-5707 p.Conference paper, Published paper (Refereed)
Abstract [en]

One of the difficulties in second language (L2) learning is the weakness in discriminating between acoustic diversity within an L2 phoneme category and between different categories. In this paper, we describe a general method to quantitatively measure the perceptual difference between a group of native and individual non-native speakers. Normally, this task includes subjective listening tests and/or a thorough linguistic study. We instead use a totally automated method based on a psycho-acoustic auditory model. For a certain phoneme class, we measure the similarity of the Euclidean space spanned by the power spectrum of a native speech signal and the Euclidean space spanned by the auditory model output. We do the same for a non-native speech signal. Comparing the two similarity measurements, we find problematic phonemes for a given speaker. To validate our method, we apply it to different groups of non-native speakers of various first language (L1) backgrounds. Our results are verified by the theoretical findings in literature obtained from linguistic studies.

Series
International Conference on Acoustics Speech and Signal Processing ICASSP, ISSN 1520-6149
Keyword
second language learning, auditory model, distortion measure, perceptual differentiation ratio, phoneme
National Category
Other Computer and Information Science Computer Science Signal Processing
Identifiers
urn:nbn:se:kth:diva-39053 (URN)10.1109/ICASSP.2011.5947655 (DOI)000296062406103 ()2-s2.0-80051656916 (Scopus ID)978-1-4577-0537-3 (ISBN)
Conference
36th IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2011; Prague; 22 May 2011 through 27 May 2011
Note
QC 20111117Available from: 2011-09-07 Created: 2011-09-07 Last updated: 2012-09-14Bibliographically approved
5. On mispronunciation analysis of individual foreign speakers using auditory periphery models
Open this publication in new window or tab >>On mispronunciation analysis of individual foreign speakers using auditory periphery models
2013 (English)In: Speech Communication, ISSN 0167-6393, E-ISSN 1872-7182, Vol. 55, no 5, 691-706 p.Article in journal (Refereed) Published
Abstract [en]

In second language (L2) learning, a major difficulty is to discriminate between the acoustic diversity within an L2 phoneme category and that between different categories. We propose a general method for automatic diagnostic assessment of the pronunciation of nonnative speakers based on models of the human auditory periphery. Considering each phoneme class separately, the geometric shape similarity between the native auditory domain and the non-native speech domain is measured. The phonemes that deviate the most from the native pronunciation for a set of L2 speakers are detected by comparing the geometric shape similarity measure with that calculated for native speakers on the same phonemes. To evaluate the system, we have tested it with different non-native speaker groups from various language backgrounds. The experimental results are in accordance with linguistic findings and human listeners' ratings, particularly when both the spectral and temporal cues of the speech signal are utilized in the pronunciation analysis.

Keyword
second language learning, auditory model, distortion measure, perceptual assessment, pronunciation error detection, phoneme
National Category
Signal Processing Other Computer and Information Science
Identifiers
urn:nbn:se:kth:diva-102319 (URN)10.1016/j.specom.2013.01.004 (DOI)000318744800008 ()2-s2.0-84876245465 (Scopus ID)
Funder
Swedish Research Council, 80449001
Note

QC 20130614. Updated from submitted to published.

Available from: 2012-09-13 Created: 2012-09-13 Last updated: 2017-12-07Bibliographically approved
6. Auditory and Dynamic Modeling Paradigms to Detect L2 Mispronunciations
Open this publication in new window or tab >>Auditory and Dynamic Modeling Paradigms to Detect L2 Mispronunciations
2012 (English)In: 13th Annual Conference of the International Speech Communication Association 2012, INTERSPEECH 2012, Vol 1, 2012, 898-901 p.Conference paper, Published paper (Refereed)
Abstract [en]

This paper expands our previous work on automatic pronunciation error detection that exploits knowledge from psychoacoustic auditory models. The new system has two additional important features, i.e., auditory and acoustic processing of the temporal cues of the speech signal, and classification feedback from a trained linear dynamic model. We also perform a pronunciation analysis by considering the task as a classification problem. Finally, we evaluate the proposed methods conducting a listening test on the same speech material and compare the judgment of the listeners and the methods. The automatic analysis based on spectro-temporal cues is shown to have the best agreement with the human evaluation, particularly with that of language teachers, and with previous plenary linguistic studies.

Keyword
L2 pronunciation error, auditory model, linear dynamic model, distortion measure, phoneme
National Category
Signal Processing Other Computer and Information Science Computer Science
Identifiers
urn:nbn:se:kth:diva-102317 (URN)000320827200225 ()2-s2.0-84878407679 (Scopus ID)978-1-62276-759-5 (ISBN)
Conference
13th Annual Conference of the International Speech Communication Association 2012, INTERSPEECH 2012; Portland, OR; United States; 9 September 2012 through 13 September 2012
Note

QC 20120914

Available from: 2012-09-13 Created: 2012-09-13 Last updated: 2013-08-22Bibliographically approved

Open Access in DiVA

fulltext(1228 kB)1176 downloads
File information
File name FULLTEXT01.pdfFile size 1228 kBChecksum SHA-512
6c56f6ce843f4edda12fa4487925bf731e1b35d1c9f3c3ad5019759a737ea3cc91f408d66a2428f5c13f02fe2a06d729e8435293e5b75a46cf8ba9d17af2e882
Type fulltextMimetype application/pdf

Search in DiVA

By author/editor
Koniaris, Christos
By organisation
Speech Communication and Technology
Computer ScienceSignal ProcessingMedia and Communication TechnologyOther Computer and Information Science

Search outside of DiVA

GoogleGoogle Scholar
Total: 1176 downloads
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

isbn
urn-nbn

Altmetric score

isbn
urn-nbn
Total: 945 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • harvard1
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf