Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • harvard1
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Mining Speech Sounds: Machine Learning Methods for Automatic Speech Recognition and Analysis
KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH.ORCID iD: 0000-0002-3323-5311
2006 (English)Doctoral thesis, comprehensive summary (Other scientific)
Abstract [en]

This thesis collects studies on machine learning methods applied to speech technology and speech research problems. The six research papers included in this thesis are organised in three main areas.

The first group of studies were carried out within the European project Synface. The aim was to develop a low latency phonetic recogniser to drive the articulatory movements of a computer generated virtual face from the acoustic speech signal. The visual information provided by the face is used as hearing aid for persons using the telephone.

Paper A compares two solutions to the problem of mapping acoustic to visual information that are based on regression and classification techniques. Recurrent Neural Networks are used to perform regression while Hidden Markov Models are used for the classification task. In the second case the visual information needed to drive the synthetic face is obtained by interpolation between target values for each acoustic class. The evaluation is based on listening tests with hearing impaired subjects were the intelligibility of sentence material is compared in different conditions: audio alone, audio and natural face, audio and synthetic face driven by the different methods.

Paper B analyses the behaviour, in low latency conditions, of a phonetic recogniser based on a hybrid of Recurrent Neural Networks (RNNs) and Hidden Markov Models (HMMs). The focus is on the interaction between the time evolution model learnt by the RNNs and the one imposed by the HMMs.

Paper C investigates the possibility of using the entropy of the posterior probabilities estimated by a phoneme classification neural network, as a feature for phonetic boundary detection. The entropy and its time evolution are analysed with respect to the identity of the phonetic segment and the distance from a reference phonetic boundary.

In the second group of studies, the aim was to provide tools for analysing large amount of speech data in order to study geographical variations in pronunciation (accent analysis).

Paper D and Paper E use Hidden Markov Models and Agglomerative Hierarchical Clustering to analyse a data set of about 100 millions data points (5000 speakers, 270 hours of speech recordings). In Paper E, Linear Discriminant Analysis was used to determine the features that most concisely describe the groupings obtained with the clustering procedure.

The third group belongs to studies carried out during the international project MILLE (Modelling Language Learning) that aims at investigating and modelling the language acquisition process in infants.

Paper F proposes the use of an incremental form of Model Based Clustering to describe the unsupervised emergence of phonetic classes in the first stages of language acquisition. The experiments were carried out on child-directed speech expressly collected for the purposes of the project

Place, publisher, year, edition, pages
Stockholm: KTH , 2006. , xix, 87 p.
Series
Trita-CSC-A, ISSN 1653-5723 ; 2006:12
Keyword [en]
speech, machine learning, data mining, signal processing
National Category
Computer Science
Identifiers
URN: urn:nbn:se:kth:diva-4111ISBN: 91-7178-446-2 (print)OAI: oai:DiVA.org:kth-4111DiVA: diva2:10785
Public defence
2006-10-06, F3, Sing Sing, Lindstedtsvägen 26, Stockholm, 13:00
Opponent
Supervisors
Note
QC 20100630Available from: 2006-09-21 Created: 2006-09-21 Last updated: 2010-06-30Bibliographically approved
List of papers
1. Using HMMs and ANNs for mapping acoustic to visual speech
Open this publication in new window or tab >>Using HMMs and ANNs for mapping acoustic to visual speech
1999 (English)In: TMH-QPSR, Vol. 40, no 1-2, 45-50 p.Article in journal (Other academic) Published
Abstract [en]

In this paper we present two different methods for mapping auditory, telephonequality speech to visual parameter trajectories, specifying the movements of ananimated synthetic face. In the first method, Hidden Markov Models (HMMs)where used to obtain phoneme strings and time labels. These where thentransformed by rules into parameter trajectories for visual speech synthesis. In thesecond method, Artificial Neural Networks (ANNs) were trained to directly mapacoustic parameters to synthesis parameters. Speaker independent HMMs weretrained on a phonetically transcribed telephone speech database. Differentunderlying units of speech were modelled by the HMMs, such as monophones,diphones, triphones, and visemes. The ANNs were trained on male, female , andmixed speakers.The HMM method and the ANN method were evaluated through audio-visualintelligibility tests with ten hearing impaired persons, and compared to “ideal”articulations (where no recognition was involved), a natural face, and to theintelligibility of the audio alone. It was found that the HMM method performsconsiderably better than the audio alone condition (54% and 34% keywordscorrect, respectively), but not as well as the “ideal” articulating artificial face(64%). The intelligibility for the ANN method was 34% keywords correct.

Place, publisher, year, edition, pages
KTH Royal Institute of Technology, 1999
National Category
Computer Science
Identifiers
urn:nbn:se:kth:diva-6150 (URN)
Note

QC 20100630. QC 20160211

Available from: 2006-09-21 Created: 2006-09-21 Last updated: 2016-02-11Bibliographically approved
2. Dynamic behaviour of connectionist speech recognition with strong latency constraints
Open this publication in new window or tab >>Dynamic behaviour of connectionist speech recognition with strong latency constraints
2006 (English)In: Speech Communication, ISSN 0167-6393, E-ISSN 1872-7182, Vol. 48, no 7, 802-818 p.Article in journal (Refereed) Published
Abstract [en]

This paper describes the use of connectionist techniques in phonetic speech recognition with strong latency constraints. The constraints are imposed by the task of deriving the lip movements of a synthetic face in real time from the speech signal, by feeding the phonetic string into an articulatory synthesiser. Particular attention has been paid to analysing the interaction between the time evolution model learnt by the multi-layer perceptrons and the transition model imposed by the Viterbi decoder, in different latency conditions. Two experiments were conducted in which the time dependencies in the language model (LM) were controlled by a parameter. The results show a strong interaction between the three factors involved, namely the neural network topology, the length of time dependencies in the LM and the decoder latency.

Place, publisher, year, edition, pages
Elsevier, 2006
Keyword
speech recognition; neural network; low latency; non-linear dynamics
National Category
Fluid Mechanics and Acoustics Computer Science Specific Languages
Identifiers
urn:nbn:se:kth:diva-6151 (URN)10.1016/j.specom.2005.05.005 (DOI)000239178600004 ()2-s2.0-33745001617 (Scopus ID)
Conference
Research Workshop on Non-Linear Speech Processing (NOLISP),Le Croisic, FRANCE, MAY 20-23, 2003
Note

QC 20100630

Available from: 2006-09-21 Created: 2006-09-21 Last updated: 2017-12-14Bibliographically approved
3. Segment boundaries in low latency phonetic recognition
Open this publication in new window or tab >>Segment boundaries in low latency phonetic recognition
2005 (English)In: NONLINEAR ANALYSES AND ALGORITHMS FOR SPEECH PROCESSING / [ed] Faundez Zanuy M; Janer L; Esposito A; Satue Villar A; Roure J; Espinosa Duro V, 2005, Vol. 3817, 267-276 p.Conference paper, Published paper (Refereed)
Abstract [en]

The segment boundaries produced by the Synface low latency phoneme recogniser are analysed. The precision in placing the boundaries is an important factor in the Synface system as the aim is to drive the lip movements of a synthetic face for lip-reading support. The recogniser is based on a hybrid of recurrent neural networks and hidden Markov models. In this paper we analyse the look-ahead length in the Viterbi-like decoder affects the precision of boundary placement. The properties of the entropy of the posterior probabilities estimated by the neural network are also investigated in relation to the distance of the frame from a phonetic transition.

Series
Lecture Notes in Artificial Intelligence, ISSN 0302-9743 ; 3817
National Category
Computer Science
Identifiers
urn:nbn:se:kth:diva-6152 (URN)000235839300023 ()2-s2.0-33745452923 (Scopus ID)3-540-31257-9 (ISBN)
Conference
International Conference on Non-Linear Speech Processing Barcelona, SPAIN, APR 19-22, 2005
Note

QC 20100630

Available from: 2006-09-21 Created: 2006-09-21 Last updated: 2015-08-03Bibliographically approved
4. Accent clustering in Swedish using the Bhattacharyya distance
Open this publication in new window or tab >>Accent clustering in Swedish using the Bhattacharyya distance
2003 (English)In: Proceedings of the 15th International Congress of Phonetic Sciences (ICPhS), Barcelona Spain, 2003, 1149-1152 p.Conference paper, Published paper (Refereed)
Abstract [en]

In an attempt to improve automatic speech recognition(ASR) models for Swedish, accent variations wereconsidered. These have proved to be important variablesin the statistical distribution of the acoustic featuresusually employed in ASR. The analysis of featurevariability have revealed phenomena that are consistentwith what is known from phonetic investigations,suggesting that a consistent part of the informationabout accents could be derived form those features. Agraphical interface has been developed to simplify thevisualization of the geographical distributions of thesephenomena.

National Category
Computer Science
Identifiers
urn:nbn:se:kth:diva-6153 (URN)
Note
QC 20100630Available from: 2006-09-21 Created: 2006-09-21 Last updated: 2010-06-30Bibliographically approved
5. Advances in regional accent clustering in Swedish
Open this publication in new window or tab >>Advances in regional accent clustering in Swedish
2005 (English)In: Proceedings of European Conference on Speech Communication and Technology (Eurospeech), 2005, 2841-2844 p.Conference paper, Published paper (Refereed)
Abstract [en]

The regional pronunciation variation in Swedish is analysed on a large database. Statistics over each phoneme and for each region of Sweden are computed using the EM algorithm in a hidden Markov model framework to overcome the difficulties of transcribing the whole set of data at the phonetic level. The model representations obtained this way are compared using a distance measure in the space spanned by the model parameters, and hierarchical clustering. The regional variants of each phoneme may group with those of any other phoneme, on the basis of their acoustic properties. The log likelihood of the data given the model is shown to display interesting properties regarding the choice of number of clusters, given a particular level of details. Discriminative analysis is used to find the parameters that most contribute to the separation between groups, adding an interpretative value to the discussion. Finally a number of examples are given on some of the phenomena that are revealed by examining the clustering tree.

National Category
Computer Science
Identifiers
urn:nbn:se:kth:diva-6154 (URN)2-s2.0-33745225165 (Scopus ID)
Conference
Interspeech'2005 - Eurospeech Lisbon, Portugal September 4-8, 2005
Note
QC 20100630Available from: 2006-09-21 Created: 2006-09-21 Last updated: 2010-06-30Bibliographically approved
6. Ecological language acquisition via incremental model-based clustering
Open this publication in new window or tab >>Ecological language acquisition via incremental model-based clustering
2005 (English)In: Proceedings of European Conference on Speech Communication and Technology (Eurospeech), Springer, 2005, 1181-1184 p.Conference paper, Published paper (Refereed)
Abstract [en]

We analyse the behaviour of Incremental Model-Based Clustering on child-directed speech data, and suggest a possible use of this method to describe the acquisition of phonetic classes by an infant. The effects of two factors are analysed, namely the number of coefficients describing the speech signal, and the frame length of the incremental clustering procedure. The results show that, although the number of predicted clusters vary in different conditions, the classifications obtained are essentially consistent. Different classifications were compared using the variation of information measure.

Place, publisher, year, edition, pages
Springer, 2005
Keyword
speech technology, language acquisition, embodied cognition, voice mapping, grounding meaning, unsupervised learning
National Category
Computer Science Specific Languages
Identifiers
urn:nbn:se:kth:diva-6155 (URN)2-s2.0-33745190282 (Scopus ID)
Conference
Interspeech'2005 - Eurospeech Lisbon, Portugal September 4-8, 2005
Note

QC 20100630. QC 20160211

Available from: 2006-09-21 Created: 2006-09-21 Last updated: 2016-06-01Bibliographically approved

Open Access in DiVA

fulltext(2189 kB)1072 downloads
File information
File name FULLTEXT01.pdfFile size 2189 kBChecksum MD5
538353269b2c7386c706cf500a2132a0caed0e101d7c1f8850d76c4851fda5fb7cd69dd1
Type fulltextMimetype application/pdf

Authority records BETA

Salvi, Giampiero

Search in DiVA

By author/editor
Salvi, Giampiero
By organisation
Speech, Music and Hearing, TMH
Computer Science

Search outside of DiVA

GoogleGoogle Scholar
Total: 1072 downloads
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

isbn
urn-nbn

Altmetric score

isbn
urn-nbn
Total: 875 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • harvard1
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf