Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • harvard1
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Accounting for Individual Speaker Properties in Automatic Speech Recognition
KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology. (TAL)
2010 (English)Licentiate thesis, comprehensive summary (Other academic)
Abstract [en]

In this work, speaker characteristic modeling has been applied in the fields of automatic speech recognition (ASR) and automatic speaker verification (ASV). In ASR, a key problem is that acoustic mismatch between training and test conditions degrade classification per- formance. In this work, a child exemplifies a speaker not represented in training data and methods to reduce the spectral mismatch are devised and evaluated. To reduce the acoustic mismatch, predictive modeling based on spectral speech transformation is applied. Follow- ing this approach, a model suitable for a target speaker, not well represented in the training data, is estimated and synthesized by applying vocal tract predictive modeling (VTPM). In this thesis, the traditional static modeling on the utterance level is extended to dynamic modeling. This is accomplished by operating also on sub-utterance units, such as phonemes, phone-realizations, sub-phone realizations and sound frames.

Initial experiments shows that adaptation of an acoustic model trained on adult speech significantly reduced the word error rate of ASR for children, but not to the level of a model trained on children’s speech. Multi-speaker-group training provided an acoustic model that performed recognition for both adults and children within the same model at almost the same accuracy as speaker-group dedicated models, with no added model complexity. In the analysis of the cause of errors, body height of the child was shown to be correlated to word error rate.

A further result is that the computationally demanding iterative recognition process in standard VTLN can be replaced by synthetically extending the vocal tract length distribution in the training data. A multi-warp model is trained on the extended data and recognition is performed in a single pass. The accuracy is similar to that of the standard technique.

A concluding experiment in ASR shows that the word error rate can be reduced by ex- tending a static vocal tract length compensation parameter into a temporal parameter track. A key component to reach this improvement was provided by a novel joint two-level opti- mization process. In the process, the track was determined as a composition of a static and a dynamic component, which were simultaneously optimized on the utterance and sub- utterance level respectively. This had the principal advantage of limiting the modulation am- plitude of the track to what is realistic for an individual speaker. The recognition error rate was reduced by 10% relative compared with that of a standard utterance-specific estimation technique.

The techniques devised and evaluated can also be applied to other speaker characteristic properties, which exhibit a dynamic nature.

An excursion into ASV led to the proposal of a statistical speaker population model. The model represents an alternative approach for determining the reject/accept threshold in an ASV system instead of the commonly used direct estimation on a set of client and impos- tor utterances. This is especially valuable in applications where a low false reject or false ac- cept rate is required. In these cases, the number of errors is often too few to estimate a reli- able threshold using the direct method. The results are encouraging but need to be verified on a larger database.

Place, publisher, year, edition, pages
Stockholm: KTH Royal Institute of Technology , 2010. , xiv, 43 p.
Series
Trita-CSC-A, ISSN 1653-5723 ; 2010:05
Keyword [en]
MAP, MLLR, VTLN, speaker characteristics, dynamic modeling, child
National Category
Language Technology (Computational Linguistics)
Identifiers
URN: urn:nbn:se:kth:diva-12258ISBN: 978-91-7415-605-8 (print)OAI: oai:DiVA.org:kth-12258DiVA: diva2:306720
Presentation
2010-04-23, Fantum, KTH, Lindstedtsvägen 24, SE-100 44 STOCKHOLM, SWEDEN, 15:15 (English)
Opponent
Supervisors
Projects
Pf-StarKOBRA
Note
QC 20110502Available from: 2010-04-08 Created: 2010-03-30 Last updated: 2011-05-02Bibliographically approved
List of papers
1. Adaptation and Normalization Experiments in Speech Recognition for 4 to 8 Year old Children.
Open this publication in new window or tab >>Adaptation and Normalization Experiments in Speech Recognition for 4 to 8 Year old Children.
2005 (English)Conference paper, Published paper (Refereed)
Abstract [en]

An experimental offline investigation of the performance of connected digits recognition was performed on children in the age range four to eight years. Poor performance using adult models was improved significantly by adaptation and vocal tract length normalisation but not to the same level as training on children. Age dependent models were tried with limited advantage. A combined adult and child raining corpus maintained the performance for the separately trained categories. Linear frequency compression for vocal tract length nor-malization was attempted but estimation of the warping factor was sensitive to non-speech segments and background noise. Phoneme-based word modeling outperformed the whole word models, even though the vocabulary only consisted of digits.

Place, publisher, year, edition, pages
Lisboa: , 2005
Keyword
MAP, MLLR, VTLN, speech recognition, child
National Category
Language Technology (Computational Linguistics)
Identifiers
urn:nbn:se:kth:diva-12253 (URN)2-s2.0-33745256261 (Scopus ID)
Conference
Interspeech
Projects
Pf-Star
Note
QC 20110502Available from: 2010-03-30 Created: 2010-03-30 Last updated: 2011-05-02Bibliographically approved
2. Vocal tract length compensation in the signal and model domains in child speech recognition
Open this publication in new window or tab >>Vocal tract length compensation in the signal and model domains in child speech recognition
2007 (English)In: Proceedings of Fonetik: TMH-QPSR, 2007, 41-44 p.Conference paper, Published paper (Other academic)
Abstract [en]

In a newly started project, KOBRA, we study methods to reduce the required amount of training data for speech recognition by combining the conventional data-driven training approach with available partial knowledge on speech production, implemented as transformation functions in the acoustic, articulatory and speaker characteristic domains. Initially, we investigate one well-known dependence, the inverse proportional relation between vocal tract length and formant frequencies. In this report, we have replaced the conventional technique of frequency warping the unknown input utterance (VTLN) by transforming the training data instead. This enables phoneme-dependent warping to be performed. In another experiment, we expanded the available training data by duplicating each training utterance into a number of differently warped instances. Training on this expanded corpus results in models, each one representing the whole range of vocal tract length variation. This technique allows every frame of the utterance to be warped differently. The computational load is reduced by an order of magnitude compared to conventional VTLN without notice- able decrease in performance on the task of recognising children’s speech using models trained on adult speech.

National Category
Language Technology (Computational Linguistics)
Identifiers
urn:nbn:se:kth:diva-12333 (URN)
Conference
Proceedings of Fonetik, 2007, 30 maj - 1 juni Sal F2, Lindstedtsvägen 26, KTH, Stockholm
Note

QC 20110502

Available from: 2010-04-09 Created: 2010-04-08 Last updated: 2016-05-23Bibliographically approved
3. Units for Dynamic Vocal Tract Length Normalization
Open this publication in new window or tab >>Units for Dynamic Vocal Tract Length Normalization
(English)Manuscript (preprint) (Other academic)
Abstract [en]

A novel method to account for dynamic speaker characteristic properties in a speech recognition system is presented. The estimated trajectory of a property can be constrained to be constant or to have a limited rate-of-change within a phone or a sub-phone state, or be allowed to change between individual speech frames. The constraints are implemented by extending each state in the HMM by a number of property-specific sub-states transformed from the original model. The connections in the transition matrix of the extended model define possible slopes of the trajectory. Constraints on its dynamic range during an utterance are implemented by decomposing the trajectory into a static and a dynamic component. Results are presented on vocal tract length normalization in connected-digit recognition of children's speech using models trained on male adult speech. The word error rate was reduced compared with the conventional utterance-specific warping factor by 10% relative.

Keyword
speech recognition, VTLN, dynamic modelling
National Category
Language Technology (Computational Linguistics)
Identifiers
urn:nbn:se:kth:diva-12242 (URN)
Projects
KOBRA
Note
QC 20110502Available from: 2010-04-08 Created: 2010-03-30 Last updated: 2011-05-02Bibliographically approved
4. Characteristics of a Low Reject Mode Speaker Verification System
Open this publication in new window or tab >>Characteristics of a Low Reject Mode Speaker Verification System
2002 (English)Conference paper, Published paper (Refereed)
Abstract [en]

The performance of a speaker verification (SV) system is normally determined by the false reject (FRR) and false accept (FAR) rates as averages on a population of test speakers. However, information on the FRR distribution is required when estimating the portion of clients that will suffer from an unacceptably high reject rate. This paper studies this distribu- tion in a population using a SV system operating in low reject mode. Two models of the distribution are proposed and compared with test data. An attempt is also made to tune the decision threshold in order to obtain a desired portion of clients having a reject rate lower than a specified value.

Place, publisher, year, edition, pages
Denver, Coplorado, USA: , 2002
Keyword
speech recognition, VTLN, dynamic modelling
National Category
Language Technology (Computational Linguistics)
Identifiers
urn:nbn:se:kth:diva-12255 (URN)
Conference
International Conference on Spoken Language Processing
Note
QC 20110502Available from: 2010-04-08 Created: 2010-03-30 Last updated: 2011-05-02Bibliographically approved

Open Access in DiVA

fulltext(600 kB)581 downloads
File information
File name FULLTEXT01.pdfFile size 600 kBChecksum SHA-512
1c277b475945fbf6c1ca2716b6cd0c9852633c916c6fb0161dd95ff723720f5c05c53f38efe5676dafde8a6577aa270e6c03293f98702eed3e89aa0b28e14ade
Type fulltextMimetype application/pdf

Search in DiVA

By author/editor
Elenius, Daniel
By organisation
Speech Communication and Technology
Language Technology (Computational Linguistics)

Search outside of DiVA

GoogleGoogle Scholar
Total: 581 downloads
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

isbn
urn-nbn

Altmetric score

isbn
urn-nbn
Total: 303 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • harvard1
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf