Change search
Refine search result
12 1 - 50 of 67
CiteExportLink to result list
Permanent link
Cite
Citation style
  • apa
  • harvard1
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Rows per page
  • 5
  • 10
  • 20
  • 50
  • 100
  • 250
Sort
  • Standard (Relevance)
  • Author A-Ö
  • Author Ö-A
  • Title A-Ö
  • Title Ö-A
  • Publication type A-Ö
  • Publication type Ö-A
  • Issued (Oldest first)
  • Issued (Newest first)
  • Created (Oldest first)
  • Created (Newest first)
  • Last updated (Oldest first)
  • Last updated (Newest first)
  • Disputation date (earliest first)
  • Disputation date (latest first)
  • Standard (Relevance)
  • Author A-Ö
  • Author Ö-A
  • Title A-Ö
  • Title Ö-A
  • Publication type A-Ö
  • Publication type Ö-A
  • Issued (Oldest first)
  • Issued (Newest first)
  • Created (Oldest first)
  • Created (Newest first)
  • Last updated (Oldest first)
  • Last updated (Newest first)
  • Disputation date (earliest first)
  • Disputation date (latest first)
Select
The maximal number of hits you can export is 250. When you want to export more records please use the Create feeds function.
  • 1.
    Agelfors, Eva
    et al.
    KTH, Superseded Departments, Speech Transmission and Music Acoustics.
    Beskow, Jonas
    Dahlquist, M
    Granström, Björn
    Lundeberg, M
    Salvi, Giampiero
    Spens, K-E
    Öhman, Tobias
    Two methods for Visual Parameter Extraction in the Teleface Project1999In: Proceedings of Fonetik, Gothenburg, Sweden, 1999Conference paper (Other academic)
  • 2.
    Agelfors, Eva
    et al.
    KTH, Superseded Departments, Speech, Music and Hearing.
    Beskow, Jonas
    KTH, Superseded Departments, Speech, Music and Hearing.
    Dahlquist, Martin
    KTH, Superseded Departments, Speech, Music and Hearing.
    Granström, Björn
    KTH, Superseded Departments, Speech, Music and Hearing.
    Lundeberg, Magnus
    Salvi, Giampiero
    KTH, Superseded Departments, Speech, Music and Hearing.
    Spens, Karl-Erik
    KTH, Superseded Departments, Speech, Music and Hearing.
    Öhman, Tobias
    A synthetic face as a lip-reading support for hearing impaired telephone users - problems and positive results1999In: European audiology in 1999: proceeding of the 4th European Conference in Audiology, Oulu, Finland, June 6-10, 1999, 1999Conference paper (Refereed)
  • 3.
    Agelfors, Eva
    et al.
    KTH, Superseded Departments, Speech, Music and Hearing.
    Beskow, Jonas
    KTH, Superseded Departments, Speech, Music and Hearing.
    Granström, Björn
    KTH, Superseded Departments, Speech, Music and Hearing.
    Lundeberg, Magnus
    KTH, Superseded Departments, Speech, Music and Hearing.
    Salvi, Giampiero
    KTH, Superseded Departments, Speech, Music and Hearing.
    Spens, Karl-Erik
    KTH, Superseded Departments, Speech, Music and Hearing.
    Öhman, Tobias
    KTH, Superseded Departments, Speech, Music and Hearing.
    Synthetic visual speech driven from auditory speech1999In: Proceedings of Audio-Visual Speech Processing (AVSP'99)), 1999Conference paper (Refereed)
    Abstract [en]

    We have developed two different methods for using auditory, telephone speech to drive the movements of a synthetic face. In the first method, Hidden Markov Models (HMMs) were trained on a phonetically transcribed telephone speech database. The output of the HMMs was then fed into a rulebased visual speech synthesizer as a string of phonemes together with time labels. In the second method, Artificial Neural Networks (ANNs) were trained on the same database to map acoustic parameters directly to facial control parameters. These target parameter trajectories were generated by using phoneme strings from a database as input to the visual speech synthesis The two methods were evaluated through audiovisual intelligibility tests with ten hearing impaired persons, and compared to “ideal” articulations (where no recognition was involved), a natural face, and to the intelligibility of the audio alone. It was found that the HMM method performs considerably better than the audio alone condition (54% and 34% keywords correct respectively), but not as well as the “ideal” articulating artificial face (64%). The intelligibility for the ANN method was 34% keywords correct.

  • 4.
    Agelfors, Eva
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH.
    Beskow, Jonas
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH.
    Karlsson, Inger
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH.
    Kewley, Jo
    Salvi, Giampiero
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH.
    Thomas, Neil
    User evaluation of the SYNFACE talking head telephone2006In: Computers Helping People With Special Needs, Proceedings / [ed] Miesenberger, K; Klaus, J; Zagler, W; Karshmer, A, 2006, Vol. 4061, p. 579-586Conference paper (Refereed)
    Abstract [en]

    The talking-head telephone, Synface, is a lip-reading support for people with hearing-impairment. It has been tested by 49 users with varying degrees of hearing-impaired in UK and Sweden in lab and home environments. Synface was found to give support to the users, especially in perceiving numbers and addresses and an enjoyable way to communicate. A majority deemed Synface to be a useful product.

  • 5.
    Al Moubayed, Samer
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology. KTH, School of Computer Science and Communication (CSC), Centres, Centre for Speech Technology, CTT.
    Beskow, Jonas
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology. KTH, School of Computer Science and Communication (CSC), Centres, Centre for Speech Technology, CTT.
    Salvi, Giampiero
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology. KTH, School of Computer Science and Communication (CSC), Centres, Centre for Speech Technology, CTT.
    SynFace Phone Recognizer for Swedish Wideband and Narrowband Speech2008In: Proceedings of The second Swedish Language Technology Conference (SLTC), Stockholm, Sweden., 2008, p. 3-6Conference paper (Other academic)
    Abstract [en]

    In this paper, we present new results and comparisons of the real-time lips synchronized talking head SynFace on different Swedish databases and bandwidth. The work involves training SynFace on narrow-band telephone speech from the Swedish SpeechDat, and on the narrow-band and wide-band Speecon corpus. Auditory perceptual tests are getting established for SynFace as an audio visual hearing support for the hearing-impaired. Preliminary results show high recognition accuracy compared to other languages.

  • 6.
    Al Moubayed, Samer
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology. KTH, School of Computer Science and Communication (CSC), Centres, Centre for Speech Technology, CTT.
    Beskow, Jonas
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology. KTH, School of Computer Science and Communication (CSC), Centres, Centre for Speech Technology, CTT.
    Öster, Anne-Marie
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology. KTH, School of Computer Science and Communication (CSC), Centres, Centre for Speech Technology, CTT.
    Salvi, Giampiero
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology. KTH, School of Computer Science and Communication (CSC), Centres, Centre for Speech Technology, CTT.
    Granström, Björn
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology. KTH, School of Computer Science and Communication (CSC), Centres, Centre for Speech Technology, CTT.
    van Son, Nic
    Viataal, Nijmegen, The Netherlands.
    Ormel, Ellen
    Viataal, Nijmegen, The Netherlands.
    Herzke, Tobias
    HörTech gGmbH, Germany.
    Studies on Using the SynFace Talking Head for the Hearing Impaired2009In: Proceedings of Fonetik'09: The XXIIth Swedish Phonetics Conference, June 10-12, 2009 / [ed] Peter Branderud, Hartmut Traunmüller, Stockholm: Stockholm University, 2009, p. 140-143Conference paper (Other academic)
    Abstract [en]

    SynFace is a lip-synchronized talking agent which is optimized as a visual reading support for the hearing impaired. In this paper wepresent the large scale hearing impaired user studies carried out for three languages in the Hearing at Home project. The user tests focuson measuring the gain in Speech Reception Threshold in Noise and the effort scaling when using SynFace by hearing impaired people, where groups of hearing impaired subjects with different impairment levels from mild to severe and cochlear implants are tested. Preliminaryanalysis of the results does not show significant gain in SRT or in effort scaling. But looking at large cross-subject variability in both tests, it isclear that many subjects benefit from SynFace especially with speech with stereo babble.

  • 7.
    Al Moubayed, Samer
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology. KTH, School of Computer Science and Communication (CSC), Centres, Centre for Speech Technology, CTT.
    Beskow, Jonas
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology. KTH, School of Computer Science and Communication (CSC), Centres, Centre for Speech Technology, CTT.
    Öster, Ann-Marie
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology. KTH, School of Computer Science and Communication (CSC), Centres, Centre for Speech Technology, CTT.
    Salvi, Giampiero
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology. KTH, School of Computer Science and Communication (CSC), Centres, Centre for Speech Technology, CTT.
    Granström, Björn
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology. KTH, School of Computer Science and Communication (CSC), Centres, Centre for Speech Technology, CTT.
    van Son, Nic
    Ormel, Ellen
    Virtual Speech Reading Support for Hard of Hearing in a Domestic Multi-Media Setting2009In: INTERSPEECH 2009: 10TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2009, BAIXAS: ISCA-INST SPEECH COMMUNICATION ASSOC , 2009, p. 1443-1446Conference paper (Refereed)
    Abstract [en]

    In this paper we present recent results on the development of the SynFace lip synchronized talking head towards multilinguality, varying signal conditions and noise robustness in the Hearing at Home project. We then describe the large scale hearing impaired user studies carried out for three languages. The user tests focus on measuring the gain in Speech Reception Threshold in Noise when using SynFace, and on measuring the effort scaling when using SynFace by hearing impaired people. Preliminary analysis of the results does not show significant gain in SRT or in effort scaling. But looking at inter-subject variability, it is clear that many subjects benefit from SynFace especially with speech with stereo babble noise.

  • 8.
    Ananthakrishnan, Gopal
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology. KTH, School of Computer Science and Communication (CSC), Centres, Centre for Speech Technology, CTT.
    Salvi, Giampiero
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology. KTH, School of Computer Science and Communication (CSC), Centres, Centre for Speech Technology, CTT.
    Using Imitation to learn Infant-Adult Acoustic Mappings2011In: 12th Annual Conference Of The International Speech Communication Association 2011 (INTERSPEECH 2011), Vols 1-5, ISCA , 2011, p. 772-775Conference paper (Refereed)
    Abstract [en]

    This paper discusses a model which conceptually demonstrates how infants could learn the normalization between infant-adult acoustics. The model proposes that the mapping can be inferred from the topological correspondences between the adult and infant acoustic spaces, that are clustered separately in an unsupervised manner. The model requires feedback from the adult in order to select the right topology for clustering, which is a crucial aspect of the model. The feedback Is in terms of an overall rating of the imitation effort by the infant, rather than a frame-by-frame correspondence. Using synthetic, but continuous speech data, we demonstrate that clusters, which have a good topological correspondence, are perceived to be similar by a phonetically trained listener.

  • 9.
    Beskow, Jonas
    et al.
    KTH, School of Computer Science and Communication (CSC), Centres, Centre for Speech Technology, CTT. KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Granström, Björn
    KTH, School of Computer Science and Communication (CSC), Centres, Centre for Speech Technology, CTT. KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Nordqvist, Peter
    KTH, School of Computer Science and Communication (CSC), Centres, Centre for Speech Technology, CTT.
    Al Moubayed, Samer
    KTH, School of Computer Science and Communication (CSC), Centres, Centre for Speech Technology, CTT. KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Salvi, Giampiero
    KTH, School of Computer Science and Communication (CSC), Centres, Centre for Speech Technology, CTT. KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Herzke, Tobias
    Schulz, Arne
    Hearing at Home: Communication support in home environments for hearing impaired persons2008In: INTERSPEECH 2008: 9TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2008, BAIXAS: ISCA-INST SPEECH COMMUNICATION ASSOC , 2008, p. 2203-2206Conference paper (Refereed)
    Abstract [en]

    The Hearing at Home (HaH) project focuses on the needs of hearing-impaired people in home environments. The project is researching and developing an innovative media-center solution for hearing support, with several integrated features that support perception of speech and audio, such as individual loudness amplification, noise reduction, audio classification and event detection, and the possibility to display an animated talking head providing real-time speechreading support. In this paper we provide a brief project overview and then describe some recent results related to the audio classifier and the talking head. As the talking head expects clean speech input, an audio classifier has been developed for the task of classifying audio signals as clean speech, speech in noise or other. The mean accuracy of the classifier was 82%. The talking head (based on technology from the SynFace project) has been adapted for German, and a small speech-in-noise intelligibility experiment was conducted where sentence recognition rates increased from 3% to 17% when the talking head was present.

  • 10.
    Beskow, Jonas
    et al.
    KTH, Superseded Departments, Speech, Music and Hearing.
    Karlsson, Inger
    KTH, Superseded Departments, Speech, Music and Hearing.
    Kewley, J
    Salvi, Giampiero
    KTH, Superseded Departments, Speech, Music and Hearing.
    SYNFACE - A talking head telephone for the hearing-impaired2004In: COMPUTERS HELPING PEOPLE WITH SPECIAL NEEDS: PROCEEDINGS / [ed] Miesenberger, K; Klaus, J; Zagler, W; Burger, D, BERLIN: SPRINGER , 2004, Vol. 3118, p. 1178-1185Conference paper (Refereed)
    Abstract [en]

    SYNFACE is a telephone aid for hearing-impaired people that shows the lip movements of the speaker at the other telephone synchronised with the speech. The SYNFACE system consists of a speech recogniser that recognises the incoming speech and a synthetic talking head. The output from the recogniser is used to control the articulatory movements of the synthetic head. SYNFACE prototype systems exist for three languages: Dutch, English and Swedish and the first user trials have just started.

  • 11.
    Beskow, Jonas
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Salvi, Giampiero
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Al Moubayed, Samer
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    SynFace: Verbal and Non-verbal Face Animation from Audio2009In: Proceedings of The International Conference on Auditory-Visual Speech Processing AVSP'09 / [ed] Barry-John Theobald, Richard Harvey, Norwich, England, 2009Conference paper (Refereed)
    Abstract [en]

    We give an overview of SynFace, a speech-drivenface animation system originally developed for theneeds of hard-of-hearing users of the telephone. Forthe 2009 LIPS challenge, SynFace includes not onlyarticulatory motion but also non-verbal motion ofgaze, eyebrows and head, triggered by detection ofacoustic correlates of prominence and cues for interactioncontrol. In perceptual evaluations, both verbaland non-verbal movmements have been found to havepositive impact on word recognition scores.

  • 12. Castellana, Antonella
    et al.
    Selamtzis, Andreas
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH.
    Salvi, Giampiero
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH.
    Carullo, Alessio
    Astolfi, Arianna
    Cepstral and entropy analyses in vowels excerpted from continuous speech of dysphonic and control speakers2017In: Proceedings of the Annual Conference of the International Speech Communication Association, Interspeech 2017 / [ed] ISCA, International Speech Communication Association, 2017, Vol. 2017, p. 1814-1818Conference paper (Refereed)
    Abstract [en]

    There is a growing interest in Cepstral and Entropy analyses of voice samples for defining a vocal health indicator, due to their reliability in investigating both regular and irregular voice signals. The purpose of this study is to determine whether the Cepstral Peak Prominence Smoothed (CPPS) and Sample Entropy (SampEn) could differentiate dysphonic speakers from normal speakers in vowels excerpted from readings and to compare their discrimination power. Results are reported for 33 patients and 31 controls, who read a standardized phonetically balanced passage while wearing a head mounted microphone. Vowels were excerpted from recordings using Automatic Speech Recognition and, after obtaining a measure for each vowel, individual distributions and their descriptive statistics were considered for CPPS and SampEn. The Receiver Operating Curve analysis revealed that the mean of the distributions was the parameter with the highest discrimination power for both CPPS and SampEn. CPPS showed a higher diagnostic precision than SampEn, exhibiting an Area Under Curve (AUC) of 0.85 compared to 0.72. A negative correlation between the parameters was found (Spearman; p = - 0.61), with higher SampEn corresponding to lower CPPS. The automatic method used in this study could provide support to voice monitorings in clinic and during individual's daily activities.

  • 13.
    Fahlström Myrman, Arvid
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Salvi, Giampiero
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH.
    Partitioning of Posteriorgrams using Siamese Models for Unsupervised Acoustic Modelling2017In: Grounding Language Understanding, 2017Conference paper (Refereed)
  • 14. Johansen, Finn Tore
    et al.
    Warakagoda, Narada
    Lindberg, Borge
    Lehtinen, Gunnar
    Kacic, Zdravko
    Zgank, Andrei
    Elenius, Kjell
    Salvi, Giampiero
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    The cost 249 speechdat multilingual reference recogniser2000Conference paper (Refereed)
    Abstract [en]

    The COST 249 SpeechDat reference recogniser is a fully automatic, language-independent training procedure for building a phonetic recogniser. It relies on the HTK toolkit and a SpeechDat(II) compatible database. The recogniser is designed to serve as a reference system in multilingual recognition research. This paper documents version 0.93 of the reference recogniser and presents results on smallvocabulary recognition for seven languages.

  • 15. Johansen, Finn Tore
    et al.
    Warakagoda, Narada
    Lindberg, Borge
    Lehtinen, Gunnar
    Kacic, Zdravko
    Zgank, Andrei
    Elenius, Kjell
    Salvi, Giampiero
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    The cost 249 speechdat multilingual reference recogniser2000Conference paper (Refereed)
    Abstract [en]

    The COST 249 SpeechDat reference recogniser is a fully automatic, language-independent training procedure for building a phonetic recogniser. It relies on the HTK toolkit and a SpeechDat(II) compatible database. The recogniser is designed to serve as a reference system in multilingual recognition research. This paper documents version 0.93 of the reference recogniser and presents results on smallvocabulary recognition for seven languages.

  • 16.
    Karlsson, Inger
    et al.
    KTH, Superseded Departments, Speech, Music and Hearing.
    Faulkner, Andrew
    Salvi, Giampiero
    KTH, Superseded Departments, Speech, Music and Hearing.
    SYNFACE - a talking face telephone2003In: Proceedings of EUROSPEECH 2003, 2003, p. 1297-1300Conference paper (Refereed)
    Abstract [en]

    The SYNFACE project has as its primary goal to facilitate for hearing-impaired people to use an ordinary telephone. This will be achieved by using a talking face connected to the telephone. The incoming speech signal will govern the speech movements of the talking face, hence the talking face will provide lip-reading support for the user.The project will define the visual speech information that supports lip-reading, and develop techniques to derive this information from the acoustic speech signal in near real time for three different languages: Dutch, English and Swedish. This requires the development of automatic speech recognition methods that detect information in the acoustic signal that correlates with the speech movements. This information will govern the speech movements in a synthetic face and synchronise them with the acoustic speech signal. A prototype system is being constructed. The prototype contains results achieved so far in SYNFACE. This system will be tested and evaluated for the three languages by hearing-impaired users. SYNFACE is an IST project (IST-2001-33327) with partners from the Netherlands, UK and Sweden. SYNFACE builds on experiences gained in the Swedish Teleface project.

  • 17.
    Koniaris, Christos
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Engwall, Olov
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Salvi, Giampiero
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Auditory and Dynamic Modeling Paradigms to Detect L2 Mispronunciations2012In: 13th Annual Conference of the International Speech Communication Association 2012, INTERSPEECH 2012, Vol 1, 2012, p. 898-901Conference paper (Refereed)
    Abstract [en]

    This paper expands our previous work on automatic pronunciation error detection that exploits knowledge from psychoacoustic auditory models. The new system has two additional important features, i.e., auditory and acoustic processing of the temporal cues of the speech signal, and classification feedback from a trained linear dynamic model. We also perform a pronunciation analysis by considering the task as a classification problem. Finally, we evaluate the proposed methods conducting a listening test on the same speech material and compare the judgment of the listeners and the methods. The automatic analysis based on spectro-temporal cues is shown to have the best agreement with the human evaluation, particularly with that of language teachers, and with previous plenary linguistic studies.

  • 18.
    Koniaris, Christos
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Engwall, Olov
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Salvi, Giampiero
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    On the Benefit of Using Auditory Modeling for Diagnostic Evaluation of Pronunciations2012In: International Symposium on Automatic Detection of Errors in Pronunciation Training (IS ADEPT), Stockholm, Sweden, June 6-8, 2012 / [ed] Olov Engwall, 2012, p. 59-64Conference paper (Refereed)
    Abstract [en]

    In this paper we demonstrate that a psychoacoustic model-based distance measure performs better than a speech signal distance measure in assessing the pronunciation of individual foreign speakers. The experiments show that the perceptual based-method performs not only quantitatively better than a speech spectrum-based method, but also qualitatively better, hence showing that auditory information is beneficial in the task of pronunciation error detection. We first present the general approach of the method, which is using the dissimilarity between the native perceptual domain and the non-native speech power spectrum domain. The problematic phonemes for a given non-native speaker are determined by the degree of disparity between the dissimilarity measure for the non-native and a group of native speakers. The two methods compared here are applied to different groups of non-native speakers of various language backgrounds and validated against a theoretical linguistic study.

  • 19.
    Koniaris, Christos
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH.
    Salvi, Giampiero
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH.
    Engwall, Olov
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH.
    On mispronunciation analysis of individual foreign speakers using auditory periphery models2013In: Speech Communication, ISSN 0167-6393, E-ISSN 1872-7182, Vol. 55, no 5, p. 691-706Article in journal (Refereed)
    Abstract [en]

    In second language (L2) learning, a major difficulty is to discriminate between the acoustic diversity within an L2 phoneme category and that between different categories. We propose a general method for automatic diagnostic assessment of the pronunciation of nonnative speakers based on models of the human auditory periphery. Considering each phoneme class separately, the geometric shape similarity between the native auditory domain and the non-native speech domain is measured. The phonemes that deviate the most from the native pronunciation for a set of L2 speakers are detected by comparing the geometric shape similarity measure with that calculated for native speakers on the same phonemes. To evaluate the system, we have tested it with different non-native speaker groups from various language backgrounds. The experimental results are in accordance with linguistic findings and human listeners' ratings, particularly when both the spectral and temporal cues of the speech signal are utilized in the pronunciation analysis.

  • 20. Krunic, Verica
    et al.
    Salvi, Giampiero
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Bernardino, Alexandre
    Montesano, Luis
    Santos-Victor, José
    Affordance based word-to-meaning association2009In: ICRA: 2009 IEEE INTERNATIONAL CONFERENCE ON ROBOTICS AND AUTOMATION, VDE Verlag GmbH, 2009, p. 4138-4143Conference paper (Refereed)
    Abstract [en]

    This paper presents a method to associate meanings to words in manipulation tasks. We base our model on an affordance network, i.e., a mapping between robot actions, robot perceptions and the perceived effects of these actions upon objects. We extend the affordance model to incorporate words. Using verbal descriptions of a task, the model uses temporal co-occurrence to create links between speech utterances and the involved objects, actions and effects. We show that the robot is able form useful word-to-meaning associations, even without considering grammatical structure in the learning process and in the presence of recognition errors. These word-to-meaning associations are embedded in the robot's own understanding of its actions. Thus they can be directly used to instruct the robot to perform tasks and also allow to incorporate context in the speech recognition task.

  • 21. Krunic, Verica
    et al.
    Salvi, Giampiero
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Bernardino, Alexandre
    Montesano, Luis
    Santos-Victor, José
    Associating word descriptions to learned manipulation task models2008In: IEEE/RSJ International Conference on Intelligent RObots and Systems (IROS), Nice, France, 2008Conference paper (Refereed)
    Abstract [en]

    This paper presents a method to associate meanings to words in manipulation tasks. We base our model on an affordance network, i.e., a mapping between robot actions, robot perceptions and the perceived effects of these actions upon objects. This knowledge is acquired by the robot in an unsupervised way by self-interaction with the environment. When a human user is involved in the process and describes a particular task, the robot can form associations between the (co-occurrence of) speech utterances and the involved objects, actions and effects. We extend the affordance model to incorporate a simple description of speech as a set of words. We show that, across many experiences, the robot is able form useful word-to-meaning associations, even without considering grammatical structure in the learning process and in the presence of recognition errors. Word-to-meaning associations are then used to instruct the robot to perform tasks and also allow to incorporate context in the speech recognition task.

  • 22.
    Kumar Dhaka, Akash
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH.
    Salvi, Giampiero
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH.
    Sparse Autoencoder Based Semi-Supervised Learning for Phone Classification with Limited Annotations2017In: Grounding Language Understanding, 2017Conference paper (Refereed)
  • 23. Lindberg, Borge
    et al.
    Johansen, Finn Tore
    Warakagoda, Narada
    Lehtinen, Gunnar
    Kacic, Zdravko
    Zgank, Andrei
    Elenius, Kjell
    Salvi, Giampiero
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    a noise robust multilingual reference recogniser based on speechdat(II)2000Conference paper (Refereed)
    Abstract [en]

    An important aspect of noise robustness of automatic speech recognisers (ASR) is the proper handling of non-speech acoustic events. The present paper describes further improvements of an already existing reference recogniser towards achieving such kind of robustness. The reference recogniser applied is the COST 249 SpeechDat reference recogniser, which is a fully automatic, language-independent training procedure for building a phonetic recogniser (http://www.telenor.no/fou/prosjekter/taletek/refrec). The reference recogniser relies on the HTK toolkit and a Speech- Dat(II) compatible database, and is designed to serve as a reference system in multilingual speech recognition research. The paper describes version 0.96 of the reference recogniser which take into account labelled non-speech acoustic events during training and provides robustness against these during testing. Results are presented on small and medium vocabulary recognition for six languages.

  • 24. Lindblom, Björn
    et al.
    Diehl, Randy
    Park, Sang-Hoon
    Salvi, Giampiero
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    (Re)use of place features in voiced stop systems: Role of phonetic constraints2008In: Proceedings of Fonetik 2008, University of Gothenburg, 2008, p. 5-8Conference paper (Other academic)
    Abstract [en]

    Computational experiments focused on place of articulation in voiced stops were designed to generate ‘optimal’ inventories of CV syllables from a larger set of ‘possible CV:s’ in the presence of independently and numerically defined articulatory, perceptual and developmental constraints. Across vowel contexts the most salient places were retroflex, palatal and uvular. This was evident from acoustic measurements and perceptual data. Simulation results using the criterion of perceptual contrast alone failed to produce systems with the typologically widely attested set [b] [d] [g], whereas using articulatory cost as the sole criterion produced inventories in which bilabial, dental/alveolar and velar onsets formed the core. Neither perceptual contrast, nor articulatory cost, (nor the two combined), produced a consistent re-use of place features (‘phonemic coding’). Only systems constrained by ‘target learning’ exhibited a strong recombination of place features.

  • 25. Lindblom, Björn
    et al.
    Diehl, Randy
    Park, Sang-Hoon
    Salvi, Giampiero
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Sound systems are shaped by their users: The recombination of phonetic substance2011In: Where Do Phonological Features Come From?: Cognitive, physical and developmental bases of distinctive speech categories / [ed] G. Nick Clements, G. N.; Ridouane, R., John Benjamins Publishing Company, 2011, p. 67-97Chapter in book (Other academic)
    Abstract [en]

    Computational experiments were run using an optimization criterion based on independently motivated definitions of perceptual contrast, articulatory cost and learning cost. The question: If stop+vowel inventories are seen as adaptations to perceptual, articulatory and developmental constraints what would they be like? Simulations successfully predicted typologically widely observed place preferences and the re-use of place features (‘phonemic coding’) in voiced stop inventories. These results demonstrate the feasibility of user-based accounts of phonological facts and indicate the nature of the constraints that over time might shape the formation of both the formal structure and the intrinsic content of sound patterns. While phonetic factors are commonly invoked to account for substantive aspects of phonology, their explanatory scope is here also extended to a fundamental attribute of its formal organization: the combinatorial re-use of phonetic content.

  • 26.
    Lopes, José
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH.
    Salvi, Giampiero
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Skantze, Gabriel
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Abad, A.
    Gustafson, Joakim
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Batista, F.
    Meena, Raveesh
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Trancoso, I.
    Detecting Repetitions in Spoken Dialogue Systems Using Phonetic Distances2015In: INTERSPEECH-2015, 2015, p. 1805-1809Conference paper (Refereed)
    Abstract [en]

    Repetitions in Spoken Dialogue Systems can be a symptom of problematic communication. Such repetitions are often due to speech recognition errors, which in turn makes it harder to use the output of the speech recognizer to detect repetitions. In this paper, we combine the alignment score obtained using phonetic distances with dialogue-related features to improve repetition detection. To evaluate the method proposed we compare several alignment techniques from edit distance to DTW-based distance, previously used in Spoken-Term detection tasks. We also compare two different methods to compute the phonetic distance: the first one using the phoneme sequence, and the second one using the distance between the phone posterior vectors. Two different datasets were used in this evaluation: a bus-schedule information system (in English) and a call routing system (in Swedish). The results show that approaches using phoneme distances over-perform approaches using Levenshtein distances between ASR outputs for repetition detection.

  • 27.
    Neiberg, Daniel
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Salvi, Giampiero
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Gustafson, Joakim
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Semi-supervised methods for exploring the acoustics of simple productive feedback2013In: Speech Communication, ISSN 0167-6393, E-ISSN 1872-7182, Vol. 55, no 3, p. 451-469Article in journal (Refereed)
    Abstract [en]

    This paper proposes methods for exploring acoustic correlates to feedback functions. A sub-language of Swedish, simple productive feedback, is introduced to facilitate investigations of the functional contributions of base tokens, phonological operations and prosody. The function of feedback is to convey the listeners' attention, understanding and affective states. In order to handle the large number of possible affective states, the current study starts by performing a listening experiment where humans annotated the functional similarity of feedback tokens with different prosodic realizations. By selecting a set of stimuli that had different prosodic distances from a reference token, it was possible to compute a generalised functional distance measure. The resulting generalised functional distance measure showed to be correlated to prosodic distance but the correlations varied as a function of base tokens and phonological operations. In a subsequent listening test, a small representative sample of feedback tokens were rated for understanding, agreement, interest, surprise and certainty. These ratings were found to explain a significant proportion of the generalised functional distance. By combining the acoustic analysis with an explorative visualisation of the prosody, we have established a map between human perception of similarity between feedback tokens, their measured distance in acoustic space, and the link to the perception of the function of feedback tokens with varying realisations.

  • 28.
    Oertel, Catharine
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Salvi, Giampiero
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    A Gaze-based Method for Relating Group Involvement to Individual Engagement in Multimodal Multiparty Dialogue2013In: ICMI 2013 - Proceedings of the 2013 ACM International Conference on Multimodal Interaction, Association for Computing Machinery (ACM), 2013, p. 99-106Conference paper (Refereed)
    Abstract [en]

    This paper is concerned with modelling individual engagement and group involvement as well as their relationship in an eight-party, mutimodal corpus. We propose a number of features (presence, entropy, symmetry and maxgaze) that summarise different aspects of eye-gaze patterns and allow us to describe individual as well as group behaviour in time. We use these features to define similarities between the subjects and we compare this information with the engagement rankings the subjects expressed at the end of each interactions about themselves and the other participants. We analyse how these features relate to four classes of group involvement and we build a classifier that is able to distinguish between those classes with 71% of accuracy.

  • 29.
    Oertel, Catharine
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Salvi, Giampiero
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Götze, Jana
    KTH, School of Computer Science and Communication (CSC), Theoretical Computer Science, TCS.
    Edlund, Jens
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Gustafson, Joakim
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Heldner, Mattias
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    The KTH Games Corpora: How to Catch a Werewolf2013In: IVA 2013 Workshop Multimodal Corpora: Beyond Audio and Video: MMC 2013, 2013Conference paper (Refereed)
  • 30. Pieropan, Alessandro
    et al.
    Salvi, Giampiero
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Pauwels, Karl
    Kjellström, Hedvig
    A dataset of human manipulation actions2014In: ICRA 2014 Workshop on Autonomous Grasping and Manipulation: An Open Challenge, 2014, Hong Kong, China, 2014Conference paper (Refereed)
    Abstract [en]

    We present a data set of human activities that includes both visual data (RGB-D video and six Degrees Of Freedom (DOF) object pose estimation) and acoustic data. Our vision is that robots need to merge information from multiple perceptional modalities to operate robustly and autonomously in an unstructured environment.

  • 31.
    Pieropan, Alessandro
    et al.
    KTH, School of Computer Science and Communication (CSC), Computer Vision and Active Perception, CVAP. KTH, School of Computer Science and Communication (CSC), Centres, Centre for Autonomous Systems, CAS.
    Salvi, Giampiero
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH.
    Pauwels, Karl
    Universidad de Granada, Spain.
    Kjellström, Hedvig
    KTH, School of Computer Science and Communication (CSC), Computer Vision and Active Perception, CVAP. KTH, School of Computer Science and Communication (CSC), Centres, Centre for Autonomous Systems, CAS.
    Audio-Visual Classification and Detection of Human Manipulation Actions2014In: 2014 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2014), IEEE conference proceedings, 2014, p. 3045-3052Conference paper (Refereed)
    Abstract [en]

    Humans are able to merge information from multiple perceptional modalities and formulate a coherent representation of the world. Our thesis is that robots need to do the same in order to operate robustly and autonomously in an unstructured environment. It has also been shown in several fields that multiple sources of information can complement each other, overcoming the limitations of a single perceptual modality. Hence, in this paper we introduce a data set of actions that includes both visual data (RGB-D video and 6DOF object pose estimation) and acoustic data. We also propose a method for recognizing and segmenting actions from continuous audio-visual data. The proposed method is employed for extensive evaluation of the descriptive power of the two modalities, and we discuss how they can be used jointly to infer a coherent interpretation of the recorded action.

  • 32.
    Salvi, Giampiero
    KTH, Superseded Departments, Speech, Music and Hearing.
    Accent clustering in Swedish using the Bhattacharyya distance2003In: Proceedings of the 15th International Congress of Phonetic Sciences (ICPhS), Barcelona Spain, 2003, p. 1149-1152Conference paper (Refereed)
    Abstract [en]

    In an attempt to improve automatic speech recognition(ASR) models for Swedish, accent variations wereconsidered. These have proved to be important variablesin the statistical distribution of the acoustic featuresusually employed in ASR. The analysis of featurevariability have revealed phenomena that are consistentwith what is known from phonetic investigations,suggesting that a consistent part of the informationabout accents could be derived form those features. Agraphical interface has been developed to simplify thevisualization of the geographical distributions of thesephenomena.

  • 33.
    Salvi, Giampiero
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH.
    Advances in regional accent clustering in Swedish2005In: Proceedings of European Conference on Speech Communication and Technology (Eurospeech), 2005, p. 2841-2844Conference paper (Refereed)
    Abstract [en]

    The regional pronunciation variation in Swedish is analysed on a large database. Statistics over each phoneme and for each region of Sweden are computed using the EM algorithm in a hidden Markov model framework to overcome the difficulties of transcribing the whole set of data at the phonetic level. The model representations obtained this way are compared using a distance measure in the space spanned by the model parameters, and hierarchical clustering. The regional variants of each phoneme may group with those of any other phoneme, on the basis of their acoustic properties. The log likelihood of the data given the model is shown to display interesting properties regarding the choice of number of clusters, given a particular level of details. Discriminative analysis is used to find the parameters that most contribute to the separation between groups, adding an interpretative value to the discussion. Finally a number of examples are given on some of the phenomena that are revealed by examining the clustering tree.

  • 34.
    Salvi, Giampiero
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    An Analysis of Shallow and Deep Representations of Speech Based on Unsupervised Classification of Isolated Words2016In: Recent Advances in Nonlinear Speech Processing, Springer, 2016, Vol. 48, p. 151-157Conference paper (Refereed)
    Abstract [en]

    We analyse the properties of shallow and deep representa-tions of speech. Mel frequency cepstral coefficients (MFCC) are compared to representations learned by a four layer Deep Belief Network (DBN) in terms of discriminative power and invariance to irrelevant factors such as speaker identity or gender. To avoid the influence of supervised statistical modelling, an unsupervised isolated word classification task is used for the comparison. The deep representations are also obtained with unsupervised training (no back-propagation pass is performed). The results show that DBN features provide a more concise clustering and higher match between clusters and word categories in terms of adjusted Rand score. Some of the confusions present with the MFCC features are, however, retained even with the DBN features.

  • 35.
    Salvi, Giampiero
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Biologically Inspired Methods for Automatic Speech Understanding2013In: Biologically Inspired Cognitive Architectures 2012 / [ed] Chella, A; Pirrone, R; Sorbello, R; Johannsdottir, KR, Springer, 2013, Vol. 196, p. 283-286Conference paper (Refereed)
    Abstract [en]

    Automatic Speech Recognition (ASR) and Understanding (ASU) systems heavily rely on machine learning techniques to solve the problem of mapping spoken utterances into words and meanings. The statistical methods employed, however, greatly deviate from the processes involved in human language acquisition in a number of key aspects. Although ASR and ASU have recently reached a level of accuracy that is sufficient for some practical applications, there are still severe limitations due, for example, to the amount of training data required and the lack of generalization of the resulting models. In our opinion, there is a need for a paradigm shift and speech technology should address some of the challenges that humans face when learning a first language and that are currently ignored by the ASR and ASU methods. In this paper, we point out some of the aspects that could lead to more robust and flexible models, and we describe some of the research we and other researchers have performed in the area.

  • 36.
    Salvi, Giampiero
    KTH, Superseded Departments, Speech, Music and Hearing.
    Developing acoustic models for automatic speech recognition in swedish1999In: The European Student Journal of Language and Speech, Vol. 1Article in journal (Refereed)
    Abstract [en]

    This thesis is concerned with automatic continuous speech recognition using trainable systems. The aim of this work is to build acoustic models for spoken Swedish. This is done employing hidden Markov models and using the SpeechDat database to train their parameters. Acoustic modeling has been worked out at a phonetic level, allowing general speech recognition applications, even though a simplified task (digits and natural number recognition) has been considered for model evaluation. Different kinds of phone models have been tested, including context independent models and two variations of context dependent models. Furthermore many experiments have been done with bigram language models to tune some of the system parameters. System performance over various speaker subsets with different sex, age and dialect has also been examined. Results are compared to previous similar studies showing a remarkable improvement.

  • 37.
    Salvi, Giampiero
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH.
    Dynamic behaviour of connectionist speech recognition with strong latency constraints2006In: Speech Communication, ISSN 0167-6393, E-ISSN 1872-7182, Vol. 48, no 7, p. 802-818Article in journal (Refereed)
    Abstract [en]

    This paper describes the use of connectionist techniques in phonetic speech recognition with strong latency constraints. The constraints are imposed by the task of deriving the lip movements of a synthetic face in real time from the speech signal, by feeding the phonetic string into an articulatory synthesiser. Particular attention has been paid to analysing the interaction between the time evolution model learnt by the multi-layer perceptrons and the transition model imposed by the Viterbi decoder, in different latency conditions. Two experiments were conducted in which the time dependencies in the language model (LM) were controlled by a parameter. The results show a strong interaction between the three factors involved, namely the neural network topology, the length of time dependencies in the LM and the decoder latency.

  • 38.
    Salvi, Giampiero
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH.
    Ecological language acquisition via incremental model-based clustering2005In: Proceedings of European Conference on Speech Communication and Technology (Eurospeech), Springer, 2005, p. 1181-1184Conference paper (Refereed)
    Abstract [en]

    We analyse the behaviour of Incremental Model-Based Clustering on child-directed speech data, and suggest a possible use of this method to describe the acquisition of phonetic classes by an infant. The effects of two factors are analysed, namely the number of coefficients describing the speech signal, and the frame length of the incremental clustering procedure. The results show that, although the number of predicted clusters vary in different conditions, the classifications obtained are essentially consistent. Different classifications were compared using the variation of information measure.

  • 39.
    Salvi, Giampiero
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH.
    Mining Speech Sounds: Machine Learning Methods for Automatic Speech Recognition and Analysis2006Doctoral thesis, comprehensive summary (Other scientific)
    Abstract [en]

    This thesis collects studies on machine learning methods applied to speech technology and speech research problems. The six research papers included in this thesis are organised in three main areas.

    The first group of studies were carried out within the European project Synface. The aim was to develop a low latency phonetic recogniser to drive the articulatory movements of a computer generated virtual face from the acoustic speech signal. The visual information provided by the face is used as hearing aid for persons using the telephone.

    Paper A compares two solutions to the problem of mapping acoustic to visual information that are based on regression and classification techniques. Recurrent Neural Networks are used to perform regression while Hidden Markov Models are used for the classification task. In the second case the visual information needed to drive the synthetic face is obtained by interpolation between target values for each acoustic class. The evaluation is based on listening tests with hearing impaired subjects were the intelligibility of sentence material is compared in different conditions: audio alone, audio and natural face, audio and synthetic face driven by the different methods.

    Paper B analyses the behaviour, in low latency conditions, of a phonetic recogniser based on a hybrid of Recurrent Neural Networks (RNNs) and Hidden Markov Models (HMMs). The focus is on the interaction between the time evolution model learnt by the RNNs and the one imposed by the HMMs.

    Paper C investigates the possibility of using the entropy of the posterior probabilities estimated by a phoneme classification neural network, as a feature for phonetic boundary detection. The entropy and its time evolution are analysed with respect to the identity of the phonetic segment and the distance from a reference phonetic boundary.

    In the second group of studies, the aim was to provide tools for analysing large amount of speech data in order to study geographical variations in pronunciation (accent analysis).

    Paper D and Paper E use Hidden Markov Models and Agglomerative Hierarchical Clustering to analyse a data set of about 100 millions data points (5000 speakers, 270 hours of speech recordings). In Paper E, Linear Discriminant Analysis was used to determine the features that most concisely describe the groupings obtained with the clustering procedure.

    The third group belongs to studies carried out during the international project MILLE (Modelling Language Learning) that aims at investigating and modelling the language acquisition process in infants.

    Paper F proposes the use of an incremental form of Model Based Clustering to describe the unsupervised emergence of phonetic classes in the first stages of language acquisition. The experiments were carried out on child-directed speech expressly collected for the purposes of the project

  • 40.
    Salvi, Giampiero
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH.
    Segment boundaries in low latency phonetic recognition2005In: NONLINEAR ANALYSES AND ALGORITHMS FOR SPEECH PROCESSING / [ed] Faundez Zanuy M; Janer L; Esposito A; Satue Villar A; Roure J; Espinosa Duro V, 2005, Vol. 3817, p. 267-276Conference paper (Refereed)
    Abstract [en]

    The segment boundaries produced by the Synface low latency phoneme recogniser are analysed. The precision in placing the boundaries is an important factor in the Synface system as the aim is to drive the lip movements of a synthetic face for lip-reading support. The recogniser is based on a hybrid of recurrent neural networks and hidden Markov models. In this paper we analyse the look-ahead length in the Viterbi-like decoder affects the precision of boundary placement. The properties of the entropy of the posterior probabilities estimated by the neural network are also investigated in relation to the distance of the frame from a phonetic transition.

  • 41.
    Salvi, Giampiero
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Segment boundary detection via class entropy measurements in connectionist phoneme recognition2006In: Speech Communication, ISSN 0167-6393, E-ISSN 1872-7182, Vol. 48, no 12, p. 1666-1676Article in journal (Refereed)
    Abstract [en]

    This article investigates the possibility to use the class entropy of the output of a connectionist phoneme recogniser to predict time boundaries between phonetic classes. The rationale is that the value of the entropy should increase in proximity of a transition between two segments that are well modelled (known) by the recognition network since it is a measure of uncertainty. The advantage of this measure is its simplicity as the posterior probabilities of each class are available in connectionist phoneme recognition.The entropy and a number of measures based on differentiation of the entropy are used in isolation and in combination. The decision methods for predicting the boundaries range from simple thresholds to neural network based procedure.The different methods are compared with respect to their precision, measured in terms of the ratio between the number C of predicted boundaries within 10 or 20 ms of the reference and the total number of predicted boundaries, and recall, measured as the ratio between C and the total number of reference boundaries.

  • 42.
    Salvi, Giampiero
    KTH, Superseded Departments, Speech, Music and Hearing.
    Truncation error and dynamics in very low latency phonetic recognition2003In: Proceedings of Non Linear Speech Processing (NOLISP), 2003Conference paper (Refereed)
    Abstract [en]

    The truncation error for a two-pass decoder is analyzed in a problem of phonetic speech recognition for very demanding latency constraints (look-ahead length < 100ms) and for applications where successive renements of the hypotheses are not allowed. This is done empirically in the framework of hybrid MLP/HMM models. The ability of recurrent MLPs, as a posteriori probability estimators, to model time variations is also considered, and its interaction with the dynamic modeling in the decoding phase is shown in the simulations.

  • 43.
    Salvi, Giampiero
    KTH, Superseded Departments, Speech, Music and Hearing.
    Using accent information in ASR models for Swedish2003In: Proceedings of INTERSPEECH'2003, 2003, p. 2677-2680Conference paper (Refereed)
    Abstract [en]

    In this study accent information is used in an attempt to improve acoustic models for automatic speech recognition (ASR). First, accent dependent Gaussian models were trained independently. The Bhattacharyya distance was then used in conjunction with agglomerative hierarchical clustering to define optimal strategies for merging those models. The resulting allophonic classes were analyzed and compared with the phonetic literature. Finally, accent "aware" models were built, in which the parametric complexity for each phoneme corresponds to the degree of variability across accent areas and to the amount of training data available for it. The models were compared to models with the same, but evenly spread, overall complexity showing in some cases a slight improvement in recognition accuracy.

  • 44.
    Salvi, Giampiero
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Al Moubayed, Samer
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Spoken Language Identification using Frame Based Entropy Measures2011In: TMH-QPSR, ISSN 1104-5787, Vol. 51, no 1, p. 69-72Article in journal (Other academic)
    Abstract [en]

    This paper presents a real-time method for Spoken Language Identification based on the entropy of the posterior probabilities of language specific phoneme recognisers. Entropy based discriminant functions computed on short speech segments are used to compare the model fit to a specific set of observations and language identification is performed as a model selection task. The experiments, performed on a closed set of four Germanic languages on the SpeechDat telephone speech recordings, give 95% accuracy of the method for 10 seconds long speech utterances and 99% accuracy for 20 seconds long utterances.

  • 45.
    Salvi, Giampiero
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Beskow, Jonas
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Al Moubayed, Samer
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Grandström, Björn
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    SynFace-Speech-Driven Facial Animation for Virtual Speech-Reading Support2009In: Eurasip Journal on Audio, Speech, and Music Processing, ISSN 1687-4714, Vol. 2009, p. 191940-Article in journal (Refereed)
    Abstract [en]

    This paper describes SynFace, a supportive technology that aims at enhancing audio-based spoken communication in adverse acoustic conditions by providing the missing visual information in the form of an animated talking head. Firstly, we describe the system architecture, consisting of a 3D animated face model controlled from the speech input by a specifically optimised phonetic recogniser. Secondly, we report on speech intelligibility experiments with focus on multilinguality and robustness to audio quality. The system, already available for Swedish, English, and Flemish, was optimised for German and for Swedish wide-band speech quality available in TV, radio, and Internet communication. Lastly, the paper covers experiments with nonverbal motions driven from the speech signal. It is shown that turn-taking gestures can be used to affect the flow of human-human dialogues. We have focused specifically on two categories of cues that may be extracted from the acoustic signal: prominence/emphasis and interactional cues (turn-taking/back-channelling).

  • 46.
    Salvi, Giampiero
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Montesano, Luis
    Bernardino, Alexandre
    Santos-Victor, José
    Language bootstrapping: Learning Word Meanings From Perception-Action Association2012In: IEEE transactions on systems, man and cybernetics. Part B. Cybernetics, ISSN 1083-4419, E-ISSN 1941-0492, Vol. 42, no 3, p. 660-671Article in journal (Refereed)
    Abstract [en]

    We address the problem of bootstrapping language acquisition for an artificial system similarly to what is observed in experiments with human infants. Our method works by associating meanings to words in manipulation tasks, as a robot interacts with objects and listens to verbal descriptions of the interactions. The model is based on an affordance network, i.e., a mapping between robot actions, robot perceptions and the perceived effects of these actions upon objects. We extend the affordance model to incorporate spoken words, which allows us to ground the verbal symbols to the execution of actions and the perception of the environment.The model takes verbal descriptions of a task as the input, and uses temporal co-occurrence to create links between speech utterances and the involved objects, actions and effects. We show that the robot is able form useful word-to-meaning associations, even without considering grammatical structure in the learning process and in the presence of recognition errors. These word-to-meaning associations are embedded in the robot’s own understanding of its actions. Thus, they can be directly used to instruct the robot to perform tasks and also allow to incorporate context in the speech recognition task. We believe that the encouraging results with our approach may afford robots with a capacity to acquire language descriptors in their operation's environment as well as to shed some light as to how this challenging process develops with human infants.

  • 47.
    Salvi, Giampiero
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Tesser, Fabio
    Zovato, Enrico
    Cosi, Piero
    Analisi Gerarchica degli Inviluppi Spettrali Differenziali di una Voce Emotiva2011Conference paper (Refereed)
  • 48.
    Salvi, Giampiero
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Tesser, Fabio
    Zovato, Enrico
    Cosi, Piero
    Cluster Analysis of Differential Spectral Envelopes on Emotional Speech2010In: 11TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2010 (INTERSPEECH 2010), VOLS 1-4, 2010, p. 322-325Conference paper (Refereed)
    Abstract [en]

    This paper reports on the analysis of the spectral variation of emotional speech. Spectral envelopes oftime aligned speech frames are compared between emotionally neutral and active utterances. Statisticsare computed over the resulting differential spectral envelopes for each phoneme. Finally, thesestatistics are classified using agglomerative hierarchical clustering and a measure of dissimilaritybetween statistical distributions and the resulting clusters are analysed. The results show that thereare systematic changes in spectral envelopes when going from neutral to sad or happy speech, andthose changes depend on the valence of the emotional content (negative, positive) as well as on thephonetic properties of the sounds such as voicing and place of articulation.

  • 49.
    Salvi, Giampiero
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH.
    Vanhainen, Niklas
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH.
    The WaveSurfer Automatic Speech Recognition Plugin2014In: LREC 2014 - NINTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2014, p. 3067-3071Conference paper (Refereed)
    Abstract [en]

    This paper presents a plugin that adds automatic speech recognition (ASR) functionality to the WaveSurfer sound manipulation and visualisation program. The plugin allows the user to run continuous speech recognition on spoken utterances, or to align an already available orthographic transcription to the spoken material. The plugin is distributed as free software and is based on free resources, namely the Julius speech recognition engine and a number of freely available ASR resources for different languages. Among these are the acoustic and language models we have created for Swedish using the NST database.

  • 50. Saponaro, G.
    et al.
    Salvi, Giampiero
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH.
    Bernardino, A.
    Robot anticipation of human intentions through continuous gesture recognition2013In: Proceedings of the 2013 International Conference on Collaboration Technologies and Systems, CTS 2013, IEEE , 2013, p. 218-225Conference paper (Refereed)
    Abstract [en]

    In this paper, we propose a method to recognize human body movements and we combine it with the contextual knowledge of human-robot collaboration scenarios provided by an object affordances framework that associates actions with its effects and the objects involved in them. The aim is to equip humanoid robots with action prediction capabilities, allowing them to anticipate effects as soon as a human partner starts performing a physical action, thus enabling interactions between man and robot to be fast and natural. We consider simple actions that characterize a human-robot collaboration scenario with objects being manipulated on a table: inspired from automatic speech recognition techniques, we train a statistical gesture model in order to recognize those physical gestures in real time. Analogies and differences between the two domains are discussed, highlighting the requirements of an automatic gesture recognizer for robots in order to perform robustly and in real time.

12 1 - 50 of 67
CiteExportLink to result list
Permanent link
Cite
Citation style
  • apa
  • harvard1
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf