Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • harvard1
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
A study on selecting and optimizing perceptually relevant features for automatic speech recognition
KTH, School of Electrical Engineering (EES).
2009 (English)Licentiate thesis, comprehensive summary (Other academic)
Abstract [en]

The performance of an automatic speech recognition (ASR) system strongly depends on the representation used for the front-end. If the extracted features do not include all relevant information, the performance of the classification stage is inherently suboptimal. This work is motivated by the fact that humans perform better at speech recognition than machines, particularly for noisy environments. The goal of this thesis is to make use of knowledge of human perception in the selection and optimization of speech features for speech recognition.

Papers A and C show that robust feature selection for speech recognition can be based on models of the human auditory system. These papers show that maximizing the similarity of the Euclidian geometry of the features to the geometry of the perceptual domain is a powerful tool to select features. Whereas conventional methods optimize classification performance, the new feature selection method exploits knowledge implicit in the human auditory system, inheriting its robustness to varying environmental conditions. The proposed algorithm show how the feature set can be learned from perception only by establishing a measure of goodness for a given feature based on a perturbation analysis and distortion criteria derived from psycho-acoustic models. Experiments with a practical speech recognizer confirm the validity of the principle.

 In Paper B the perceptually relevant objective criterion is used to define new features. Again the motivation has its origin at the human peripheral auditory system which plays a major role to the input speech signal until it reaches the central auditory system of the brain where the recognition occurs. While many feature extraction techniques incorporate knowledge of the auditory system, the procedures are usually designed for a specific task, and they lack of the most recently gained knowledge on human hearing. Paper B shows an approach to improve mel frequency cepstrum coefficients (MFCCs) through off-line optimization. The method has three advantages: i) it is computational inexpensive, ii) it does not use the auditory model directly, thus avoiding its computational cost, and iii) importantly, it provides better recognition performance than  traditional MFCCs for both clean and noisy conditions

 

Place, publisher, year, edition, pages
Stockholm: KTH , 2009. , xv, 37 p.
Series
Trita-EE, ISSN 1653-5146 ; 2009:049
Identifiers
URN: urn:nbn:se:kth:diva-11470ISBN: 978-91-7415-478-8 (print)OAI: oai:DiVA.org:kth-11470DiVA: diva2:276957
Presentation
2009-11-27, E2, KTH, Lindstedtsvägen 3, Stockholm, 10:00 (English)
Opponent
Supervisors
Available from: 2009-11-13 Created: 2009-11-13 Last updated: 2010-10-15Bibliographically approved
List of papers
1. Auditory-model based robust feature selection for speech recognition
Open this publication in new window or tab >>Auditory-model based robust feature selection for speech recognition
2010 (English)In: Journal of the Acoustical Society of America, ISSN 0001-4966, E-ISSN 1520-8524, Vol. 127, no 2, EL73-EL79 p.Article in journal (Refereed) Published
Abstract [en]

 It is shown that robust dimension-reduction of a feature set for speech recognition can be based on a model of the human auditory system. Whereas conventional methods optimize classification performance, the proposed method exploits knowledge implicit in the auditory periphery, inheriting its robustness. Features are selected to maximize the similarity of the Euclidean geometry of the feature domain and the perceptual domain. Recognition experiments using mel-frequency cepstral coefficients (MFCCs) confirm the effectiveness of the approach, which does not require labeled training data. For noisy data the method outperforms commonly used discriminant-analysis based dimension-reduction methods that rely on labeling. The results indicate that selecting MFCCs in their natural order results in subsets with good performance.

Keyword
feature selection, auditory model, sensitivity matrix, speech recognition
National Category
Signal Processing
Identifiers
urn:nbn:se:kth:diva-11467 (URN)10.1121/1.3284545 (DOI)000274322200010 ()2-s2.0-76349109466 (Scopus ID)
Note
QC 20100831 Uppdaterad från submitted till published (20100831)Available from: 2009-11-13 Created: 2009-11-13 Last updated: 2017-12-12Bibliographically approved
2. Auditory model based optimization of MFCCs improves automatic speech recognition performance
Open this publication in new window or tab >>Auditory model based optimization of MFCCs improves automatic speech recognition performance
2009 (English)In: INTERSPEECH 2009: 10TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2009, 2009, 2943-2946 p.Conference paper, Published paper (Refereed)
Abstract [en]

Using a spectral auditory model along with perturbation based analysis, we develop a new framework to optimize a set of features such that it emulates the behavior of the human auditory system. The optimization is carried out in an off-line manner based on the conjecture that the local geometries of the feature domain and the perceptual auditory domain should be similar. Using this principle, we modify and optimize the static mel frequency cepstral coefficients (MFCCs) without considering any feedback from the speech recognition system. We show that improved recognition performance is obtained for any environmental condition, clean as well as noisy.

Keyword
ASR, Auditory model, MFCC
National Category
Signal Processing
Identifiers
urn:nbn:se:kth:diva-11468 (URN)000276842801277 ()2-s2.0-70450221097 (Scopus ID)978-1-61567-692-7 (ISBN)
Conference
10th Annual Conference of the International Speech Communication Association, INTERSPEECH 2009; Brighton; 6 September 2009 - 10 September 2009
Note
QC 20101015Available from: 2009-11-13 Created: 2009-11-13 Last updated: 2012-09-14Bibliographically approved
3. Selecting static and dynamic features using an advanced auditory model for speech recognition
Open this publication in new window or tab >>Selecting static and dynamic features using an advanced auditory model for speech recognition
2010 (English)In: Proceedings 2010 IEEE International Conference on Acoustics, Speech, and Signal Processing, IEEE , 2010, 4590-4593 p.Conference paper, Published paper (Refereed)
Abstract [en]

We describe a method to select features for speech recognition that is based on a quantitative model of the human auditory periphery. The method maximizes the similarity of the geometry of the space spanned by the subset of features and the geometry of the space spanned by the auditory model output. The selection method uses a spectro-temporal auditory model that captures both frequency- and time-domain masking. The selection method is blind to the meaning of speech and does not require annotated speech data. We apply the method to the selection of a subset of features from a conventional set consisting of mel cepstra and their first-order and second-order time derivatives. Although our method uses only knowledge of the human auditory periphery, the experimental results show that it performs significantly better than feature-reduction algorithms based on linear and heteroscedastic discriminant analysis that require training with annotated speech data.

Place, publisher, year, edition, pages
IEEE, 2010
Series
Proceedings of the IEEE international conference on acoustics, speech and signal processing, ISSN 1520-6149
Keyword
feature selection, dimension reduction, auditory model, perception, sensitivity analysis, distortion, speech recognition
National Category
Signal Processing
Identifiers
urn:nbn:se:kth:diva-11469 (URN)10.1109/ICASSP.2010.5495648 (DOI)000287096004068 ()2-s2.0-78049406665 (Scopus ID)978-1-4244-4296-6 (ISBN)
Conference
2010 IEEE International Conference on Acoustics, Speech, and Signal Processing, March 14–19, 2010, Dallas, Texas, U.S.A.
Note
QC 20110415; WINNER OF THE BEST STUDENT PAPER AWARD (1st place).Available from: 2009-11-13 Created: 2009-11-13 Last updated: 2012-09-14Bibliographically approved

Open Access in DiVA

fulltext(988 kB)741 downloads
File information
File name FULLTEXT01.pdfFile size 988 kBChecksum SHA-512
6e8724ea9257b82144b0bd6f4a1c172a81e96ed1cee436b4eaa3ea7172d3dad12a2592a0d0cda9cb74ed730be1931f97d19cef60d0459bad29639f8e7ee84795
Type fulltextMimetype application/pdf

Search in DiVA

By author/editor
Koniaris, Christos
By organisation
School of Electrical Engineering (EES)

Search outside of DiVA

GoogleGoogle Scholar
Total: 741 downloads
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

isbn
urn-nbn

Altmetric score

isbn
urn-nbn
Total: 849 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • harvard1
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf