Human Audio-Visual Consonant Recognition Analyzed with Three Bimodal Integration Models
2009 (English)In: INTERSPEECH 2009: 10TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2009, VOLS 1-5, BAIXAS: ISCA-INST SPEECH COMMUNICATION ASSOC , 2009, 812-815 p.Conference paper (Refereed)
With A-V recordings. ten normal hearing people took recognition tests at different signal-to-noise ratios (SNR). The AV recognition results are predicted by the fuzzy logical model of perception (FLMP) and the post-labelling integration model (POSTL). We also applied hidden Markov models (HMMs) and multi-stream HMMs (MSHMMs) for the recognition. As expected, all the models agree qualitatively with the results that the benefit gained from the visual signal is larger at lower acoustic SNRs. However, the FLMP severely overestimates the AV integration result, while the POSTL model underestimates it. Our automatic speech recognizers integrated the audio and visual stream efficiently. The visual automatic speech recognizer could be adjusted to correspond to human visual performance. The MSHMMs combine the audio and visual streams efficiently, but the audio automatic speech recognizer must be further improved to allow precise quantitative comparisons with human audio-visual performance.
Place, publisher, year, edition, pages
BAIXAS: ISCA-INST SPEECH COMMUNICATION ASSOC , 2009. 812-815 p.
Audio-visual recognition, Fuzzy Logical Model of Perception, Post-Labelling Model, Hidden Markov Models, Multi-Stream Hidden Markov Models
Computer and Information Science General Language Studies and Linguistics
IdentifiersURN: urn:nbn:se:kth:diva-29880ISI: 000276842800203ScopusID: 2-s2.0-70450192523ISBN: 978-1-61567-692-7OAI: oai:DiVA.org:kth-29880DiVA: diva2:399103
10th Annual Conference of the International Speech Communication Association
QC 201102212011-02-212011-02-172011-11-15Bibliographically approved