Change search
ReferencesLink to record
Permanent link

Direct link
Dynamic behaviour of connectionist speech recognition with strong latency constraints
KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH.ORCID iD: 0000-0002-3323-5311
2006 (English)In: Speech Communication, ISSN 0167-6393, Vol. 48, no 7, 802-818 p.Article in journal (Refereed) Published
Abstract [en]

This paper describes the use of connectionist techniques in phonetic speech recognition with strong latency constraints. The constraints are imposed by the task of deriving the lip movements of a synthetic face in real time from the speech signal, by feeding the phonetic string into an articulatory synthesiser. Particular attention has been paid to analysing the interaction between the time evolution model learnt by the multi-layer perceptrons and the transition model imposed by the Viterbi decoder, in different latency conditions. Two experiments were conducted in which the time dependencies in the language model (LM) were controlled by a parameter. The results show a strong interaction between the three factors involved, namely the neural network topology, the length of time dependencies in the LM and the decoder latency.

Place, publisher, year, edition, pages
Elsevier, 2006. Vol. 48, no 7, 802-818 p.
Keyword [en]
speech recognition; neural network; low latency; non-linear dynamics
National Category
Fluid Mechanics and Acoustics Computer Science Specific Languages
URN: urn:nbn:se:kth:diva-6151DOI: 10.1016/j.specom.2005.05.005ISI: 000239178600004ScopusID: 2-s2.0-33745001617OAI: diva2:10780
Research Workshop on Non-Linear Speech Processing (NOLISP),Le Croisic, FRANCE, MAY 20-23, 2003

QC 20100630

Available from: 2006-09-21 Created: 2006-09-21 Last updated: 2013-09-12Bibliographically approved
In thesis
1. Mining Speech Sounds: Machine Learning Methods for Automatic Speech Recognition and Analysis
Open this publication in new window or tab >>Mining Speech Sounds: Machine Learning Methods for Automatic Speech Recognition and Analysis
2006 (English)Doctoral thesis, comprehensive summary (Other scientific)
Abstract [en]

This thesis collects studies on machine learning methods applied to speech technology and speech research problems. The six research papers included in this thesis are organised in three main areas.

The first group of studies were carried out within the European project Synface. The aim was to develop a low latency phonetic recogniser to drive the articulatory movements of a computer generated virtual face from the acoustic speech signal. The visual information provided by the face is used as hearing aid for persons using the telephone.

Paper A compares two solutions to the problem of mapping acoustic to visual information that are based on regression and classification techniques. Recurrent Neural Networks are used to perform regression while Hidden Markov Models are used for the classification task. In the second case the visual information needed to drive the synthetic face is obtained by interpolation between target values for each acoustic class. The evaluation is based on listening tests with hearing impaired subjects were the intelligibility of sentence material is compared in different conditions: audio alone, audio and natural face, audio and synthetic face driven by the different methods.

Paper B analyses the behaviour, in low latency conditions, of a phonetic recogniser based on a hybrid of Recurrent Neural Networks (RNNs) and Hidden Markov Models (HMMs). The focus is on the interaction between the time evolution model learnt by the RNNs and the one imposed by the HMMs.

Paper C investigates the possibility of using the entropy of the posterior probabilities estimated by a phoneme classification neural network, as a feature for phonetic boundary detection. The entropy and its time evolution are analysed with respect to the identity of the phonetic segment and the distance from a reference phonetic boundary.

In the second group of studies, the aim was to provide tools for analysing large amount of speech data in order to study geographical variations in pronunciation (accent analysis).

Paper D and Paper E use Hidden Markov Models and Agglomerative Hierarchical Clustering to analyse a data set of about 100 millions data points (5000 speakers, 270 hours of speech recordings). In Paper E, Linear Discriminant Analysis was used to determine the features that most concisely describe the groupings obtained with the clustering procedure.

The third group belongs to studies carried out during the international project MILLE (Modelling Language Learning) that aims at investigating and modelling the language acquisition process in infants.

Paper F proposes the use of an incremental form of Model Based Clustering to describe the unsupervised emergence of phonetic classes in the first stages of language acquisition. The experiments were carried out on child-directed speech expressly collected for the purposes of the project

Place, publisher, year, edition, pages
Stockholm: KTH, 2006. xix, 87 p.
Trita-CSC-A, ISSN 1653-5723 ; 2006:12
speech, machine learning, data mining, signal processing
National Category
Computer Science
urn:nbn:se:kth:diva-4111 (URN)91-7178-446-2 (ISBN)
Public defence
2006-10-06, F3, Sing Sing, Lindstedtsvägen 26, Stockholm, 13:00
QC 20100630Available from: 2006-09-21 Created: 2006-09-21 Last updated: 2010-06-30Bibliographically approved

Open Access in DiVA

dynamicbehaviour(202 kB)98 downloads
File information
File name FULLTEXT01.pdfFile size 202 kBChecksum SHA-512
Type fulltextMimetype application/pdf

Other links

Publisher's full textScopus

Search in DiVA

By author/editor
Salvi, Giampiero
By organisation
Speech, Music and Hearing, TMH
In the same journal
Speech Communication
Fluid Mechanics and AcousticsComputer ScienceSpecific Languages

Search outside of DiVA

GoogleGoogle Scholar
Total: 98 downloads
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

Altmetric score

Total: 57 hits
ReferencesLink to record
Permanent link

Direct link