Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • harvard1
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Selecting static and dynamic features using an advanced auditory model for speech recognition
KTH, School of Electrical Engineering (EES), Sound and Image Processing.
KTH, School of Electrical Engineering (EES), Sound and Image Processing.ORCID iD: 0000-0003-2638-6047
KTH, School of Electrical Engineering (EES), Sound and Image Processing.
2010 (English)In: Proceedings 2010 IEEE International Conference on Acoustics, Speech, and Signal Processing, IEEE , 2010, 4590-4593 p.Conference paper, Published paper (Refereed)
Abstract [en]

We describe a method to select features for speech recognition that is based on a quantitative model of the human auditory periphery. The method maximizes the similarity of the geometry of the space spanned by the subset of features and the geometry of the space spanned by the auditory model output. The selection method uses a spectro-temporal auditory model that captures both frequency- and time-domain masking. The selection method is blind to the meaning of speech and does not require annotated speech data. We apply the method to the selection of a subset of features from a conventional set consisting of mel cepstra and their first-order and second-order time derivatives. Although our method uses only knowledge of the human auditory periphery, the experimental results show that it performs significantly better than feature-reduction algorithms based on linear and heteroscedastic discriminant analysis that require training with annotated speech data.

Place, publisher, year, edition, pages
IEEE , 2010. 4590-4593 p.
Series
Proceedings of the IEEE international conference on acoustics, speech and signal processing, ISSN 1520-6149
Keyword [en]
feature selection, dimension reduction, auditory model, perception, sensitivity analysis, distortion, speech recognition
National Category
Signal Processing
Identifiers
URN: urn:nbn:se:kth:diva-11469DOI: 10.1109/ICASSP.2010.5495648ISI: 000287096004068Scopus ID: 2-s2.0-78049406665ISBN: 978-1-4244-4296-6 (print)OAI: oai:DiVA.org:kth-11469DiVA: diva2:276953
Conference
2010 IEEE International Conference on Acoustics, Speech, and Signal Processing, March 14–19, 2010, Dallas, Texas, U.S.A.
Note
QC 20110415; WINNER OF THE BEST STUDENT PAPER AWARD (1st place).Available from: 2009-11-13 Created: 2009-11-13 Last updated: 2012-09-14Bibliographically approved
In thesis
1. A study on selecting and optimizing perceptually relevant features for automatic speech recognition
Open this publication in new window or tab >>A study on selecting and optimizing perceptually relevant features for automatic speech recognition
2009 (English)Licentiate thesis, comprehensive summary (Other academic)
Abstract [en]

The performance of an automatic speech recognition (ASR) system strongly depends on the representation used for the front-end. If the extracted features do not include all relevant information, the performance of the classification stage is inherently suboptimal. This work is motivated by the fact that humans perform better at speech recognition than machines, particularly for noisy environments. The goal of this thesis is to make use of knowledge of human perception in the selection and optimization of speech features for speech recognition.

Papers A and C show that robust feature selection for speech recognition can be based on models of the human auditory system. These papers show that maximizing the similarity of the Euclidian geometry of the features to the geometry of the perceptual domain is a powerful tool to select features. Whereas conventional methods optimize classification performance, the new feature selection method exploits knowledge implicit in the human auditory system, inheriting its robustness to varying environmental conditions. The proposed algorithm show how the feature set can be learned from perception only by establishing a measure of goodness for a given feature based on a perturbation analysis and distortion criteria derived from psycho-acoustic models. Experiments with a practical speech recognizer confirm the validity of the principle.

 In Paper B the perceptually relevant objective criterion is used to define new features. Again the motivation has its origin at the human peripheral auditory system which plays a major role to the input speech signal until it reaches the central auditory system of the brain where the recognition occurs. While many feature extraction techniques incorporate knowledge of the auditory system, the procedures are usually designed for a specific task, and they lack of the most recently gained knowledge on human hearing. Paper B shows an approach to improve mel frequency cepstrum coefficients (MFCCs) through off-line optimization. The method has three advantages: i) it is computational inexpensive, ii) it does not use the auditory model directly, thus avoiding its computational cost, and iii) importantly, it provides better recognition performance than  traditional MFCCs for both clean and noisy conditions

 

Place, publisher, year, edition, pages
Stockholm: KTH, 2009. xv, 37 p.
Series
Trita-EE, ISSN 1653-5146 ; 2009:049
Identifiers
urn:nbn:se:kth:diva-11470 (URN)978-91-7415-478-8 (ISBN)
Presentation
2009-11-27, E2, KTH, Lindstedtsvägen 3, Stockholm, 10:00 (English)
Opponent
Supervisors
Available from: 2009-11-13 Created: 2009-11-13 Last updated: 2010-10-15Bibliographically approved
2. Perceptually motivated speech recognition and mispronunciation detection
Open this publication in new window or tab >>Perceptually motivated speech recognition and mispronunciation detection
2012 (English)Doctoral thesis, comprehensive summary (Other academic)
Abstract [en]

This doctoral thesis is the result of a research effort performed in two fields of speech technology, i.e., speech recognition and mispronunciation detection. Although the two areas are clearly distinguishable, the proposed approaches share a common hypothesis based on psychoacoustic processing of speech signals. The conjecture implies that the human auditory periphery provides a relatively good separation of different sound classes. Hence, it is possible to use recent findings from psychoacoustic perception together with mathematical and computational tools to model the auditory sensitivities to small speech signal changes.

The performance of an automatic speech recognition system strongly depends on the representation used for the front-end. If the extracted features do not include all relevant information, the performance of the classification stage is inherently suboptimal. The work described in Papers A, B and C is motivated by the fact that humans perform better at speech recognition than machines, particularly for noisy environments. The goal is to make use of knowledge of human perception in the selection and optimization of speech features for speech recognition. These papers show that maximizing the similarity of the Euclidean geometry of the features to the geometry of the perceptual domain is a powerful tool to select or optimize features. Experiments with a practical speech recognizer confirm the validity of the principle. It is also shown an approach to improve mel frequency cepstrum coefficients (MFCCs) through offline optimization. The method has three advantages: i) it is computationally inexpensive, ii) it does not use the auditory model directly, thus avoiding its computational cost, and iii) importantly, it provides better recognition performance than traditional MFCCs for both clean and noisy conditions.

The second task concerns automatic pronunciation error detection. The research, described in Papers D, E and F, is motivated by the observation that almost all native speakers perceive, relatively easily, the acoustic characteristics of their own language when it is produced by speakers of the language. Small variations within a phoneme category, sometimes different for various phonemes, do not change significantly the perception of the language’s own sounds. Several methods are introduced based on similarity measures of the Euclidean space spanned by the acoustic representations of the speech signal and the Euclidean space spanned by an auditory model output, to identify the problematic phonemes for a given speaker. The methods are tested for groups of speakers from different languages and evaluated according to a theoretical linguistic study showing that they can capture many of the problematic phonemes that speakers from each language mispronounce. Finally, a listening test on the same dataset verifies the validity of these methods.

Place, publisher, year, edition, pages
Stockholm: KTH Royal Institute of Technology, 2012. xxi, 79 p.
Series
Trita CSC-A, ISSN 1653-5723 ; 2012:10
Keyword
feature extraction, feature selection, auditory models, MFCCs, speech recognition, distortion measures, perturbation analysis, psychoacoustics, human perception, sensitivity matrix, pronunciation error detection, phoneme, second language, perceptual assessment
National Category
Computer Science Signal Processing Media and Communication Technology Other Computer and Information Science
Identifiers
urn:nbn:se:kth:diva-102321 (URN)978-91-7501-468-5 (ISBN)
Public defence
2012-10-05, A2, Östermalmsgatan 26, KTH, Stockholm, 10:00 (English)
Opponent
Supervisors
Projects
European Union FP6-034362 research project ACORNSComputer-Animated language Teachers (CALATea)
Note

QC 20120914

Available from: 2012-09-14 Created: 2012-09-13 Last updated: 2012-09-14Bibliographically approved

Open Access in DiVA

No full text

Other links

Publisher's full textScopushttp://ieeexplore.ieee.org/Xplore/login.jsp?url=http%3A%2F%2Fieeexplore.ieee.org%2Fstamp%2Fstamp.jsp%3Farnumber%3D05495648&authDecision=-203

Authority records BETA

Chatterjee, Saikat

Search in DiVA

By author/editor
Koniaris, ChristosChatterjee, SaikatKleijn, W. Baastian
By organisation
Sound and Image Processing
Signal Processing

Search outside of DiVA

GoogleGoogle Scholar

doi
isbn
urn-nbn

Altmetric score

doi
isbn
urn-nbn
Total: 325 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • harvard1
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf