kth.sePublikationer
Ändra sökning
Avgränsa sökresultatet
12 1 - 50 av 86
RefereraExporteraLänk till träfflistan
Permanent länk
Referera
Referensformat
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Annat format
Fler format
Språk
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Annat språk
Fler språk
Utmatningsformat
  • html
  • text
  • asciidoc
  • rtf
Träffar per sida
  • 5
  • 10
  • 20
  • 50
  • 100
  • 250
Sortering
  • Standard (Relevans)
  • Författare A-Ö
  • Författare Ö-A
  • Titel A-Ö
  • Titel Ö-A
  • Publikationstyp A-Ö
  • Publikationstyp Ö-A
  • Äldst först
  • Nyast först
  • Skapad (Äldst först)
  • Skapad (Nyast först)
  • Senast uppdaterad (Äldst först)
  • Senast uppdaterad (Nyast först)
  • Disputationsdatum (tidigaste först)
  • Disputationsdatum (senaste först)
  • Standard (Relevans)
  • Författare A-Ö
  • Författare Ö-A
  • Titel A-Ö
  • Titel Ö-A
  • Publikationstyp A-Ö
  • Publikationstyp Ö-A
  • Äldst först
  • Nyast först
  • Skapad (Äldst först)
  • Skapad (Nyast först)
  • Senast uppdaterad (Äldst först)
  • Senast uppdaterad (Nyast först)
  • Disputationsdatum (tidigaste först)
  • Disputationsdatum (senaste först)
Markera
Maxantalet träffar du kan exportera från sökgränssnittet är 250. Vid större uttag använd dig av utsökningar.
  • 1.
    Abdelnour, Jerome
    et al.
    NECOTIS Dept. of Electrical and Computer Engineering, Sherbrooke University, Canada.
    Rouat, Jean
    NECOTIS Dept. of Electrical and Computer Engineering, Sherbrooke University, Canada.
    Salvi, Giampiero
    KTH, Skolan för elektroteknik och datavetenskap (EECS), Intelligenta system, Tal, musik och hörsel, TMH. Department of Electronic Systems, Norwegian University of Science and Technology, Norway.
    NAAQA: A Neural Architecture for Acoustic Question Answering2022Ingår i: IEEE Transactions on Pattern Analysis and Machine Intelligence, ISSN 0162-8828, E-ISSN 1939-3539, s. 1-12Artikel i tidskrift (Refereegranskat)
    Ladda ner fulltext (pdf)
    fulltext
  • 2. Adiban, M.
    et al.
    Safari, A.
    Salvi, Giampiero
    KTH, Skolan för elektroteknik och datavetenskap (EECS), Intelligenta system, Tal, musik och hörsel, TMH.
    Step-gan: A one-class anomaly detection model with applications to power system security2021Ingår i: ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings, Institute of Electrical and Electronics Engineers (IEEE) , 2021, s. 2605-2609Konferensbidrag (Refereegranskat)
    Abstract [en]

    Smart grid systems (SGSs), and in particular power systems, play a vital role in today's urban life. The security of these grids is now threatened by adversaries that use false data injection (FDI) to produce a breach of availability, integrity, or confidential principles of the system. We propose a novel structure for the multigenerator generative adversarial network (GAN) to address the challenges of detecting adversarial attacks. We modify the GAN objective function and the training procedure for the malicious anomaly detection task. The model only requires normal operation data to be trained, making it cheaper to deploy and robust against unseen attacks. Moreover, the model operates on the raw input data, eliminating the need for feature extraction. We show that the model reduces the well-known mode collapse problem of GAN-based systems, it has low computational complexity and considerably outperforms the baseline system (OCAN) with about 55% in terms of accuracy on a freely available cyber attack dataset.

  • 3. Adiban, Mohammad
    et al.
    Siniscalchi, Marco
    Stefanov, Kalin
    Salvi, Giampiero
    KTH, Skolan för elektroteknik och datavetenskap (EECS), Intelligenta system, Tal, musik och hörsel, TMH, Tal-kommunikation. Norwegian University of Science and Technology Trondheim, Norway.
    Hierarchical Residual Learning Based Vector Quantized Variational Autoencorder for Image Reconstruction and Generation2022Ingår i: The 33rd British Machine Vision Conference Proceedings, 2022Konferensbidrag (Refereegranskat)
    Abstract [en]

    We propose a multi-layer variational autoencoder method, we call HR-VQVAE, thatlearns hierarchical discrete representations of the data. By utilizing a novel objectivefunction, each layer in HR-VQVAE learns a discrete representation of the residual fromprevious layers through a vector quantized encoder. Furthermore, the representations ateach layer are hierarchically linked to those at previous layers. We evaluate our methodon the tasks of image reconstruction and generation. Experimental results demonstratethat the discrete representations learned by HR-VQVAE enable the decoder to reconstructhigh-quality images with less distortion than the baseline methods, namely VQVAE andVQVAE-2. HR-VQVAE can also generate high-quality and diverse images that outperform state-of-the-art generative models, providing further verification of the efficiency ofthe learned representations. The hierarchical nature of HR-VQVAE i) reduces the decoding search time, making the method particularly suitable for high-load tasks and ii) allowsto increase the codebook size without incurring the codebook collapse problem.

    Ladda ner fulltext (pdf)
    fulltext
  • 4.
    Adiban, Mohammad
    et al.
    NTNU, Dept Elect Syst, Trondheim, Norway.;Monash Univ, Dept Human Centred Comp, Melbourne, Australia..
    Siniscalchi, Sabato Marco
    NTNU, Dept Elect Syst, Trondheim, Norway..
    Salvi, Giampiero
    KTH, Skolan för elektroteknik och datavetenskap (EECS), Intelligenta system, Tal, musik och hörsel, TMH. NTNU, Dept Elect Syst, Trondheim, Norway..
    A step-by-step training method for multi generator GANs with application to anomaly detection and cybersecurity2023Ingår i: Neurocomputing, ISSN 0925-2312, E-ISSN 1872-8286, Vol. 537, s. 296-308Artikel i tidskrift (Refereegranskat)
    Abstract [en]

    Cyber attacks and anomaly detection are problems where the data is often highly unbalanced towards normal observations. Furthermore, the anomalies observed in real applications may be significantly different from the ones contained in the training data. It is, therefore, desirable to study methods that are able to detect anomalies only based on the distribution of the normal data. To address this problem, we propose a novel objective function for generative adversarial networks (GANs), referred to as STEPGAN. STEP-GAN simulates the distribution of possible anomalies by learning a modified version of the distribution of the task-specific normal data. It leverages multiple generators in a step-by-step interaction with a discriminator in order to capture different modes in the data distribution. The discriminator is optimized to distinguish not only between normal data and anomalies but also between the different generators, thus encouraging each generator to model a different mode in the distribution. This reduces the well-known mode collapse problem in GAN models considerably. We tested our method in the areas of power systems and network traffic control systems (NTCSs) using two publicly available highly imbalanced datasets, ICS (Industrial Control System) security dataset and UNSW-NB15, respectively. In both application domains, STEP-GAN outperforms the state-of-the-art systems as well as the two baseline systems we implemented as a comparison. In order to assess the generality of our model, additional experiments were carried out on seven real-world numerical datasets for anomaly detection in a variety of domains. In all datasets, the number of normal samples is significantly more than that of abnormal samples. Experimental results show that STEP-GAN outperforms several semi-supervised methods while being competitive with supervised methods.

  • 5.
    Adiban, Mohammad
    et al.
    Norwegian University of Science and Technology Trondheim, Norway; Monash University Melbourne, Australia.
    Stefanov, Kalin
    Monash University Melbourne, Australia.
    Siniscalchi, Sabato Marco
    Norwegian University of Science and Technology Trondheim, Norway.
    Salvi, Giampiero
    KTH, Skolan för elektroteknik och datavetenskap (EECS), Intelligenta system, Tal, musik och hörsel, TMH. Norwegian University of Science and Technology Trondheim, Norway.
    Hierarchical Residual Learning Based Vector Quantized Variational Autoencoder for Image Reconstruction and Generation2022Ingår i: BMVC 2022 - 33rd British Machine Vision Conference Proceedings, British Machine Vision Association (BMVA) , 2022Konferensbidrag (Refereegranskat)
    Abstract [en]

    We propose a multi-layer variational autoencoder method, we call HR-VQVAE, that learns hierarchical discrete representations of the data. By utilizing a novel objective function, each layer in HR-VQVAE learns a discrete representation of the residual from previous layers through a vector quantized encoder. Furthermore, the representations at each layer are hierarchically linked to those at previous layers. We evaluate our method on the tasks of image reconstruction and generation. Experimental results demonstrate that the discrete representations learned by HR-VQVAE enable the decoder to reconstruct high-quality images with less distortion than the baseline methods, namely VQVAE and VQVAE-2. HR-VQVAE can also generate high-quality and diverse images that outperform state-of-the-art generative models, providing further verification of the efficiency of the learned representations. The hierarchical nature of HR-VQVAE i) reduces the decoding search time, making the method particularly suitable for high-load tasks and ii) allows to increase the codebook size without incurring the codebook collapse problem.

  • 6.
    Agelfors, Eva
    et al.
    KTH, Tidigare Institutioner (före 2005), Talöverföring och musikakustik.
    Beskow, Jonas
    Dahlquist, M
    Granström, Björn
    Lundeberg, M
    Salvi, Giampiero
    Spens, K-E
    Öhman, Tobias
    Two methods for Visual Parameter Extraction in the Teleface Project1999Ingår i: Proceedings of Fonetik, Gothenburg, Sweden, 1999Konferensbidrag (Övrigt vetenskapligt)
  • 7.
    Agelfors, Eva
    et al.
    KTH, Tidigare Institutioner (före 2005), Tal, musik och hörsel.
    Beskow, Jonas
    KTH, Tidigare Institutioner (före 2005), Tal, musik och hörsel.
    Dahlquist, Martin
    KTH, Tidigare Institutioner (före 2005), Tal, musik och hörsel.
    Granström, Björn
    KTH, Tidigare Institutioner (före 2005), Tal, musik och hörsel.
    Lundeberg, Magnus
    Salvi, Giampiero
    KTH, Tidigare Institutioner (före 2005), Tal, musik och hörsel.
    Spens, Karl-Erik
    KTH, Tidigare Institutioner (före 2005), Tal, musik och hörsel.
    Öhman, Tobias
    A synthetic face as a lip-reading support for hearing impaired telephone users - problems and positive results1999Ingår i: European audiology in 1999: proceeding of the 4th European Conference in Audiology, Oulu, Finland, June 6-10, 1999, 1999Konferensbidrag (Refereegranskat)
  • 8.
    Agelfors, Eva
    et al.
    KTH, Tidigare Institutioner (före 2005), Tal, musik och hörsel.
    Beskow, Jonas
    KTH, Tidigare Institutioner (före 2005), Tal, musik och hörsel.
    Granström, Björn
    KTH, Tidigare Institutioner (före 2005), Tal, musik och hörsel.
    Lundeberg, Magnus
    KTH, Tidigare Institutioner (före 2005), Tal, musik och hörsel.
    Salvi, Giampiero
    KTH, Tidigare Institutioner (före 2005), Tal, musik och hörsel.
    Spens, Karl-Erik
    KTH, Tidigare Institutioner (före 2005), Tal, musik och hörsel.
    Öhman, Tobias
    KTH, Tidigare Institutioner (före 2005), Tal, musik och hörsel.
    Synthetic visual speech driven from auditory speech1999Ingår i: Proceedings of Audio-Visual Speech Processing (AVSP'99)), 1999Konferensbidrag (Refereegranskat)
    Abstract [en]

    We have developed two different methods for using auditory, telephone speech to drive the movements of a synthetic face. In the first method, Hidden Markov Models (HMMs) were trained on a phonetically transcribed telephone speech database. The output of the HMMs was then fed into a rulebased visual speech synthesizer as a string of phonemes together with time labels. In the second method, Artificial Neural Networks (ANNs) were trained on the same database to map acoustic parameters directly to facial control parameters. These target parameter trajectories were generated by using phoneme strings from a database as input to the visual speech synthesis The two methods were evaluated through audiovisual intelligibility tests with ten hearing impaired persons, and compared to “ideal” articulations (where no recognition was involved), a natural face, and to the intelligibility of the audio alone. It was found that the HMM method performs considerably better than the audio alone condition (54% and 34% keywords correct respectively), but not as well as the “ideal” articulating artificial face (64%). The intelligibility for the ANN method was 34% keywords correct.

  • 9.
    Agelfors, Eva
    et al.
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH.
    Beskow, Jonas
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH.
    Karlsson, Inger
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH.
    Kewley, Jo
    Salvi, Giampiero
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH.
    Thomas, Neil
    User evaluation of the SYNFACE talking head telephone2006Ingår i: Computers Helping People With Special Needs, Proceedings / [ed] Miesenberger, K; Klaus, J; Zagler, W; Karshmer, A, 2006, Vol. 4061, s. 579-586Konferensbidrag (Refereegranskat)
    Abstract [en]

    The talking-head telephone, Synface, is a lip-reading support for people with hearing-impairment. It has been tested by 49 users with varying degrees of hearing-impaired in UK and Sweden in lab and home environments. Synface was found to give support to the users, especially in perceiving numbers and addresses and an enjoyable way to communicate. A majority deemed Synface to be a useful product.

  • 10.
    Al Moubayed, Samer
    et al.
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH, Tal-kommunikation. KTH, Skolan för datavetenskap och kommunikation (CSC), Centra, Centrum för Talteknologi, CTT.
    Beskow, Jonas
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH, Tal-kommunikation. KTH, Skolan för datavetenskap och kommunikation (CSC), Centra, Centrum för Talteknologi, CTT.
    Salvi, Giampiero
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH, Tal-kommunikation. KTH, Skolan för datavetenskap och kommunikation (CSC), Centra, Centrum för Talteknologi, CTT.
    SynFace Phone Recognizer for Swedish Wideband and Narrowband Speech2008Ingår i: Proceedings of The second Swedish Language Technology Conference (SLTC), Stockholm, Sweden., 2008, s. 3-6Konferensbidrag (Övrigt vetenskapligt)
    Abstract [en]

    In this paper, we present new results and comparisons of the real-time lips synchronized talking head SynFace on different Swedish databases and bandwidth. The work involves training SynFace on narrow-band telephone speech from the Swedish SpeechDat, and on the narrow-band and wide-band Speecon corpus. Auditory perceptual tests are getting established for SynFace as an audio visual hearing support for the hearing-impaired. Preliminary results show high recognition accuracy compared to other languages.

  • 11.
    Al Moubayed, Samer
    et al.
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH, Tal-kommunikation. KTH, Skolan för datavetenskap och kommunikation (CSC), Centra, Centrum för Talteknologi, CTT.
    Beskow, Jonas
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH, Tal-kommunikation. KTH, Skolan för datavetenskap och kommunikation (CSC), Centra, Centrum för Talteknologi, CTT.
    Öster, Anne-Marie
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH, Tal-kommunikation. KTH, Skolan för datavetenskap och kommunikation (CSC), Centra, Centrum för Talteknologi, CTT.
    Salvi, Giampiero
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH, Tal-kommunikation. KTH, Skolan för datavetenskap och kommunikation (CSC), Centra, Centrum för Talteknologi, CTT.
    Granström, Björn
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH, Tal-kommunikation. KTH, Skolan för datavetenskap och kommunikation (CSC), Centra, Centrum för Talteknologi, CTT.
    van Son, Nic
    Viataal, Nijmegen, The Netherlands.
    Ormel, Ellen
    Viataal, Nijmegen, The Netherlands.
    Herzke, Tobias
    HörTech gGmbH, Germany.
    Studies on Using the SynFace Talking Head for the Hearing Impaired2009Ingår i: Proceedings of Fonetik'09: The XXIIth Swedish Phonetics Conference, June 10-12, 2009 / [ed] Peter Branderud, Hartmut Traunmüller, Stockholm: Stockholm University, 2009, s. 140-143Konferensbidrag (Övrigt vetenskapligt)
    Abstract [en]

    SynFace is a lip-synchronized talking agent which is optimized as a visual reading support for the hearing impaired. In this paper wepresent the large scale hearing impaired user studies carried out for three languages in the Hearing at Home project. The user tests focuson measuring the gain in Speech Reception Threshold in Noise and the effort scaling when using SynFace by hearing impaired people, where groups of hearing impaired subjects with different impairment levels from mild to severe and cochlear implants are tested. Preliminaryanalysis of the results does not show significant gain in SRT or in effort scaling. But looking at large cross-subject variability in both tests, it isclear that many subjects benefit from SynFace especially with speech with stereo babble.

  • 12.
    Al Moubayed, Samer
    et al.
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH, Tal-kommunikation. KTH, Skolan för datavetenskap och kommunikation (CSC), Centra, Centrum för Talteknologi, CTT.
    Beskow, Jonas
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH, Tal-kommunikation. KTH, Skolan för datavetenskap och kommunikation (CSC), Centra, Centrum för Talteknologi, CTT.
    Öster, Ann-Marie
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH, Tal-kommunikation. KTH, Skolan för datavetenskap och kommunikation (CSC), Centra, Centrum för Talteknologi, CTT.
    Salvi, Giampiero
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH, Tal-kommunikation. KTH, Skolan för datavetenskap och kommunikation (CSC), Centra, Centrum för Talteknologi, CTT.
    Granström, Björn
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH, Tal-kommunikation. KTH, Skolan för datavetenskap och kommunikation (CSC), Centra, Centrum för Talteknologi, CTT.
    van Son, Nic
    Ormel, Ellen
    Virtual Speech Reading Support for Hard of Hearing in a Domestic Multi-Media Setting2009Ingår i: INTERSPEECH 2009: 10TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2009, BAIXAS: ISCA-INST SPEECH COMMUNICATION ASSOC , 2009, s. 1443-1446Konferensbidrag (Refereegranskat)
    Abstract [en]

    In this paper we present recent results on the development of the SynFace lip synchronized talking head towards multilinguality, varying signal conditions and noise robustness in the Hearing at Home project. We then describe the large scale hearing impaired user studies carried out for three languages. The user tests focus on measuring the gain in Speech Reception Threshold in Noise when using SynFace, and on measuring the effort scaling when using SynFace by hearing impaired people. Preliminary analysis of the results does not show significant gain in SRT or in effort scaling. But looking at inter-subject variability, it is clear that many subjects benefit from SynFace especially with speech with stereo babble noise.

  • 13.
    Ananthakrishnan, Gopal
    et al.
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH, Tal-kommunikation. KTH, Skolan för datavetenskap och kommunikation (CSC), Centra, Centrum för Talteknologi, CTT.
    Salvi, Giampiero
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH, Tal-kommunikation. KTH, Skolan för datavetenskap och kommunikation (CSC), Centra, Centrum för Talteknologi, CTT.
    Using Imitation to learn Infant-Adult Acoustic Mappings2011Ingår i: 12th Annual Conference Of The International Speech Communication Association 2011 (INTERSPEECH 2011), Vols 1-5, ISCA , 2011, s. 772-775Konferensbidrag (Refereegranskat)
    Abstract [en]

    This paper discusses a model which conceptually demonstrates how infants could learn the normalization between infant-adult acoustics. The model proposes that the mapping can be inferred from the topological correspondences between the adult and infant acoustic spaces, that are clustered separately in an unsupervised manner. The model requires feedback from the adult in order to select the right topology for clustering, which is a crucial aspect of the model. The feedback Is in terms of an overall rating of the imitation effort by the infant, rather than a frame-by-frame correspondence. Using synthetic, but continuous speech data, we demonstrate that clusters, which have a good topological correspondence, are perceived to be similar by a phonetically trained listener.

  • 14.
    Beskow, Jonas
    et al.
    KTH, Skolan för datavetenskap och kommunikation (CSC), Centra, Centrum för Talteknologi, CTT. KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH, Tal-kommunikation.
    Granström, Björn
    KTH, Skolan för datavetenskap och kommunikation (CSC), Centra, Centrum för Talteknologi, CTT. KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH, Tal-kommunikation.
    Nordqvist, Peter
    KTH, Skolan för datavetenskap och kommunikation (CSC), Centra, Centrum för Talteknologi, CTT.
    Al Moubayed, Samer
    KTH, Skolan för datavetenskap och kommunikation (CSC), Centra, Centrum för Talteknologi, CTT. KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH, Tal-kommunikation.
    Salvi, Giampiero
    KTH, Skolan för datavetenskap och kommunikation (CSC), Centra, Centrum för Talteknologi, CTT. KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH, Tal-kommunikation.
    Herzke, Tobias
    Schulz, Arne
    Hearing at Home: Communication support in home environments for hearing impaired persons2008Ingår i: INTERSPEECH 2008: 9TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2008, BAIXAS: ISCA-INST SPEECH COMMUNICATION ASSOC , 2008, s. 2203-2206Konferensbidrag (Refereegranskat)
    Abstract [en]

    The Hearing at Home (HaH) project focuses on the needs of hearing-impaired people in home environments. The project is researching and developing an innovative media-center solution for hearing support, with several integrated features that support perception of speech and audio, such as individual loudness amplification, noise reduction, audio classification and event detection, and the possibility to display an animated talking head providing real-time speechreading support. In this paper we provide a brief project overview and then describe some recent results related to the audio classifier and the talking head. As the talking head expects clean speech input, an audio classifier has been developed for the task of classifying audio signals as clean speech, speech in noise or other. The mean accuracy of the classifier was 82%. The talking head (based on technology from the SynFace project) has been adapted for German, and a small speech-in-noise intelligibility experiment was conducted where sentence recognition rates increased from 3% to 17% when the talking head was present.

  • 15.
    Beskow, Jonas
    et al.
    KTH, Tidigare Institutioner (före 2005), Tal, musik och hörsel.
    Karlsson, Inger
    KTH, Tidigare Institutioner (före 2005), Tal, musik och hörsel.
    Kewley, J
    Salvi, Giampiero
    KTH, Tidigare Institutioner (före 2005), Tal, musik och hörsel.
    SYNFACE - A talking head telephone for the hearing-impaired2004Ingår i: COMPUTERS HELPING PEOPLE WITH SPECIAL NEEDS: PROCEEDINGS / [ed] Miesenberger, K; Klaus, J; Zagler, W; Burger, D, BERLIN: SPRINGER , 2004, Vol. 3118, s. 1178-1185Konferensbidrag (Refereegranskat)
    Abstract [en]

    SYNFACE is a telephone aid for hearing-impaired people that shows the lip movements of the speaker at the other telephone synchronised with the speech. The SYNFACE system consists of a speech recogniser that recognises the incoming speech and a synthetic talking head. The output from the recogniser is used to control the articulatory movements of the synthetic head. SYNFACE prototype systems exist for three languages: Dutch, English and Swedish and the first user trials have just started.

  • 16.
    Beskow, Jonas
    et al.
    KTH, Skolan för elektroteknik och datavetenskap (EECS), Intelligenta system, Tal, musik och hörsel, TMH.
    Salvi, Giampiero
    KTH, Skolan för elektroteknik och datavetenskap (EECS), Intelligenta system, Tal, musik och hörsel, TMH.
    Al Moubayed, Samer
    KTH, Skolan för elektroteknik och datavetenskap (EECS), Intelligenta system, Tal, musik och hörsel, TMH.
    SynFace - Verbal and Non-verbal Face Animation from Audio2009Ingår i: Auditory-Visual Speech Processing 2009, AVSP 2009, The International Society for Computers and Their Applications (ISCA) , 2009Konferensbidrag (Refereegranskat)
    Abstract [en]

    We give an overview of SynFace, a speech-driven face animation system originally developed for the needs of hard-of-hearing users of the telephone. For the 2009 LIPS challenge, SynFace includes not only articulatory motion but also non-verbal motion of gaze, eyebrows and head, triggered by detection of acoustic correlates of prominence and cues for interaction control. In perceptual evaluations, both verbal and non-verbal movmements have been found to have positive impact on word recognition scores. 

  • 17.
    Beskow, Jonas
    et al.
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH, Tal-kommunikation.
    Salvi, Giampiero
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH, Tal-kommunikation.
    Al Moubayed, Samer
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH, Tal-kommunikation.
    SynFace: Verbal and Non-verbal Face Animation from Audio2009Ingår i: Proceedings of The International Conference on Auditory-Visual Speech Processing AVSP'09 / [ed] Barry-John Theobald, Richard Harvey, Norwich, England, 2009Konferensbidrag (Refereegranskat)
    Abstract [en]

    We give an overview of SynFace, a speech-drivenface animation system originally developed for theneeds of hard-of-hearing users of the telephone. Forthe 2009 LIPS challenge, SynFace includes not onlyarticulatory motion but also non-verbal motion ofgaze, eyebrows and head, triggered by detection ofacoustic correlates of prominence and cues for interactioncontrol. In perceptual evaluations, both verbaland non-verbal movmements have been found to havepositive impact on word recognition scores.

  • 18.
    Cao, Xinwei
    et al.
    Department of Electronic Systems, NTNU, Norway.
    Fan, Zijian
    Department of Electronic Systems, NTNU, Norway.
    Svendsen, Torbjørn
    Department of Electronic Systems, NTNU, Norway.
    Salvi, Giampiero
    KTH, Skolan för elektroteknik och datavetenskap (EECS), Intelligenta system, Tal, musik och hörsel, TMH. Department of Electronic Systems, NTNU, Norway.
    An Analysis of Goodness of Pronunciation for Child Speech2023Ingår i: Interspeech 2023, International Speech Communication Association , 2023, s. 4613-4617Konferensbidrag (Refereegranskat)
    Abstract [en]

    In this paper, we study the use of goodness of pronunciation (GOP) on child speech. We first compare the distributions of GOP scores on several open datasets representing various dimensions of speech variability. We show that the GOP distribution over CMU Kids, corresponding to young age, has larger spread than those on datasets representing other dimensions, i.e., accent, dialect, spontaneity and environmental conditions. We hypothesize that the increased variability of pronunciation in young age may impair the use of traditional mispronunciation detection methods for children. To support this hypothesis, we perform simulated mispronunciation experiments both for children and adults using different variants of the GOP algorithm. We also compare the results to real-case mispronunciations for native children showing that GOP is less effective for child speech than for adult speech.

  • 19. Castellana, Antonella
    et al.
    Selamtzis, Andreas
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH.
    Salvi, Giampiero
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH.
    Carullo, Alessio
    Astolfi, Arianna
    Cepstral and entropy analyses in vowels excerpted from continuous speech of dysphonic and control speakers2017Ingår i: Proceedings of the Annual Conference of the International Speech Communication Association, Interspeech 2017 / [ed] ISCA, International Speech Communication Association, 2017, Vol. 2017, s. 1814-1818Konferensbidrag (Refereegranskat)
    Abstract [en]

    There is a growing interest in Cepstral and Entropy analyses of voice samples for defining a vocal health indicator, due to their reliability in investigating both regular and irregular voice signals. The purpose of this study is to determine whether the Cepstral Peak Prominence Smoothed (CPPS) and Sample Entropy (SampEn) could differentiate dysphonic speakers from normal speakers in vowels excerpted from readings and to compare their discrimination power. Results are reported for 33 patients and 31 controls, who read a standardized phonetically balanced passage while wearing a head mounted microphone. Vowels were excerpted from recordings using Automatic Speech Recognition and, after obtaining a measure for each vowel, individual distributions and their descriptive statistics were considered for CPPS and SampEn. The Receiver Operating Curve analysis revealed that the mean of the distributions was the parameter with the highest discrimination power for both CPPS and SampEn. CPPS showed a higher diagnostic precision than SampEn, exhibiting an Area Under Curve (AUC) of 0.85 compared to 0.72. A negative correlation between the parameters was found (Spearman; p = - 0.61), with higher SampEn corresponding to lower CPPS. The automatic method used in this study could provide support to voice monitorings in clinic and during individual's daily activities.

  • 20.
    Fahlström Myrman, Arvid
    et al.
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH, Tal-kommunikation.
    Salvi, Giampiero
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH.
    Partitioning of Posteriorgrams using Siamese Models for Unsupervised Acoustic Modelling2017Ingår i: Grounding Language Understanding, 2017Konferensbidrag (Refereegranskat)
    Ladda ner fulltext (pdf)
    fulltext
  • 21. Getman, Yaroslav
    et al.
    Al-Ghezi, Ragheb
    Voskoboinik, Katja
    Grósz, Tamás
    Kurimo, Mikko
    Salvi, Giampiero
    Svendsen, Torbjørn
    Strömbergsson, Sofia
    wav2vec2-based Speech Rating System for Children with Speech Sound Disorder2022Konferensbidrag (Refereegranskat)
    Ladda ner fulltext (pdf)
    fulltext
  • 22.
    Getman, Yaroslav
    et al.
    Aalto Univ, Dept Informat & Commun Engn, Espoo 02150, Finland..
    Phan, Nhan
    Aalto Univ, Dept Informat & Commun Engn, Espoo 02150, Finland..
    Al-Ghezi, Ragheb
    Aalto Univ, Dept Informat & Commun Engn, Espoo 02150, Finland..
    Voskoboinik, Ekaterina
    Aalto Univ, Dept Informat & Commun Engn, Espoo 02150, Finland..
    Singh, Mittul
    Aalto Univ, Dept Informat & Commun Engn, Espoo 02150, Finland.;Silo AI, Helsinki 00180, Finland..
    Grosz, Tamas
    Aalto Univ, Dept Informat & Commun Engn, Espoo 02150, Finland..
    Kurimo, Mikko
    Aalto Univ, Dept Informat & Commun Engn, Espoo 02150, Finland..
    Salvi, Giampiero
    KTH, Skolan för elektroteknik och datavetenskap (EECS), Intelligenta system, Tal, musik och hörsel, TMH. Norwegian Univ Sci & Technol, Dept Signal Proc, N-7034 Trondheim, Norway.;KTH Royal Inst Technol, EECS, S-11428 Stockholm, Sweden..
    Svendsen, Torbjorn
    Norwegian Univ Sci & Technol, Dept Signal Proc, N-7034 Trondheim, Norway..
    Strombergsson, Sofia
    Karolinska Inst, Dept Clin Sci Intervent & Technol, S-14152 Huddinge, Sweden..
    Smolander, Anna
    Tampere Univ, Fac Social Sci, Logoped, Welf Sci, Tampere 33100, Finland..
    Ylinen, Sari
    Tampere Univ, Fac Social Sci, Logoped, Welf Sci, Tampere 33100, Finland..
    Developing an AI-Assisted Low-Resource Spoken Language Learning App for Children2023Ingår i: IEEE Access, E-ISSN 2169-3536, Vol. 11, s. 86025-86037Artikel i tidskrift (Refereegranskat)
    Abstract [en]

    Computer-assisted Language Learning (CALL) is a rapidly developing area accelerated by advancements in the field of AI. A well-designed and reliable CALL system allows students to practice language skills, like pronunciation, any time outside of the classroom. Furthermore, gamification via mobile applications has shown encouraging results on learning outcomes and motivates young users to practice more and perceive language learning as a positive experience. In this work, we adapt the latest speech recognition technology to be a part of an online pronunciation training system for small children. As part of our gamified mobile application, our models will assess the pronunciation quality of young Swedish children diagnosed with Speech Sound Disorder, and participating in speech therapy. Additionally, the models provide feedback to young non-native children learning to pronounce Swedish and Finnish words. Our experiments revealed that these new models fit into an online game as they function as speech recognizers and pronunciation evaluators simultaneously. To make our systems more trustworthy and explainable, we investigated whether the combination of modern input attribution algorithms and time-aligned transcripts can explain the decisions made by the models, give us insights into how the models work and provide a tool to develop more reliable solutions.

  • 23. Johansen, Finn Tore
    et al.
    Warakagoda, Narada
    Lindberg, Borge
    Lehtinen, Gunnar
    Kacic, Zdravko
    Zgank, Andrei
    Elenius, Kjell
    Salvi, Giampiero
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH, Tal-kommunikation.
    The cost 249 speechdat multilingual reference recogniser2000Konferensbidrag (Refereegranskat)
    Abstract [en]

    The COST 249 SpeechDat reference recogniser is a fully automatic, language-independent training procedure for building a phonetic recogniser. It relies on the HTK toolkit and a SpeechDat(II) compatible database. The recogniser is designed to serve as a reference system in multilingual recognition research. This paper documents version 0.93 of the reference recogniser and presents results on smallvocabulary recognition for seven languages.

  • 24. Johansen, Finn Tore
    et al.
    Warakagoda, Narada
    Lindberg, Borge
    Lehtinen, Gunnar
    Kacic, Zdravko
    Zgank, Andrei
    Elenius, Kjell
    Salvi, Giampiero
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH, Tal-kommunikation.
    The cost 249 speechdat multilingual reference recogniser2000Konferensbidrag (Refereegranskat)
    Abstract [en]

    The COST 249 SpeechDat reference recogniser is a fully automatic, language-independent training procedure for building a phonetic recogniser. It relies on the HTK toolkit and a SpeechDat(II) compatible database. The recogniser is designed to serve as a reference system in multilingual recognition research. This paper documents version 0.93 of the reference recogniser and presents results on smallvocabulary recognition for seven languages.

  • 25.
    Karlsson, Inger
    et al.
    KTH, Tidigare Institutioner (före 2005), Tal, musik och hörsel.
    Faulkner, Andrew
    Salvi, Giampiero
    KTH, Tidigare Institutioner (före 2005), Tal, musik och hörsel.
    SYNFACE - a talking face telephone2003Ingår i: Proceedings of EUROSPEECH 2003, 2003, s. 1297-1300Konferensbidrag (Refereegranskat)
    Abstract [en]

    The SYNFACE project has as its primary goal to facilitate for hearing-impaired people to use an ordinary telephone. This will be achieved by using a talking face connected to the telephone. The incoming speech signal will govern the speech movements of the talking face, hence the talking face will provide lip-reading support for the user.The project will define the visual speech information that supports lip-reading, and develop techniques to derive this information from the acoustic speech signal in near real time for three different languages: Dutch, English and Swedish. This requires the development of automatic speech recognition methods that detect information in the acoustic signal that correlates with the speech movements. This information will govern the speech movements in a synthetic face and synchronise them with the acoustic speech signal. A prototype system is being constructed. The prototype contains results achieved so far in SYNFACE. This system will be tested and evaluated for the three languages by hearing-impaired users. SYNFACE is an IST project (IST-2001-33327) with partners from the Netherlands, UK and Sweden. SYNFACE builds on experiences gained in the Swedish Teleface project.

  • 26.
    Koniaris, Christos
    et al.
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH, Tal-kommunikation.
    Engwall, Olov
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH, Tal-kommunikation.
    Salvi, Giampiero
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH, Tal-kommunikation.
    Auditory and Dynamic Modeling Paradigms to Detect L2 Mispronunciations2012Ingår i: 13th Annual Conference of the International Speech Communication Association 2012, INTERSPEECH 2012, Vol 1, 2012, s. 898-901Konferensbidrag (Refereegranskat)
    Abstract [en]

    This paper expands our previous work on automatic pronunciation error detection that exploits knowledge from psychoacoustic auditory models. The new system has two additional important features, i.e., auditory and acoustic processing of the temporal cues of the speech signal, and classification feedback from a trained linear dynamic model. We also perform a pronunciation analysis by considering the task as a classification problem. Finally, we evaluate the proposed methods conducting a listening test on the same speech material and compare the judgment of the listeners and the methods. The automatic analysis based on spectro-temporal cues is shown to have the best agreement with the human evaluation, particularly with that of language teachers, and with previous plenary linguistic studies.

  • 27.
    Koniaris, Christos
    et al.
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH, Tal-kommunikation.
    Engwall, Olov
    KTH, Skolan för elektroteknik och datavetenskap (EECS), Intelligenta system, Tal, musik och hörsel, TMH.
    Salvi, Giampiero
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH, Tal-kommunikation.
    On the Benefit of Using Auditory Modeling for Diagnostic Evaluation of Pronunciations2012Ingår i: International Symposium on Automatic Detection of Errors in Pronunciation Training (IS ADEPT), Stockholm, Sweden, June 6-8, 2012 / [ed] Olov Engwall, 2012, s. 59-64Konferensbidrag (Refereegranskat)
    Abstract [en]

    In this paper we demonstrate that a psychoacoustic model-based distance measure performs better than a speech signal distance measure in assessing the pronunciation of individual foreign speakers. The experiments show that the perceptual based-method performs not only quantitatively better than a speech spectrum-based method, but also qualitatively better, hence showing that auditory information is beneficial in the task of pronunciation error detection. We first present the general approach of the method, which is using the dissimilarity between the native perceptual domain and the non-native speech power spectrum domain. The problematic phonemes for a given non-native speaker are determined by the degree of disparity between the dissimilarity measure for the non-native and a group of native speakers. The two methods compared here are applied to different groups of non-native speakers of various language backgrounds and validated against a theoretical linguistic study.

  • 28.
    Koniaris, Christos
    et al.
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH.
    Salvi, Giampiero
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH.
    Engwall, Olov
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH.
    On mispronunciation analysis of individual foreign speakers using auditory periphery models2013Ingår i: Speech Communication, ISSN 0167-6393, E-ISSN 1872-7182, Vol. 55, nr 5, s. 691-706Artikel i tidskrift (Refereegranskat)
    Abstract [en]

    In second language (L2) learning, a major difficulty is to discriminate between the acoustic diversity within an L2 phoneme category and that between different categories. We propose a general method for automatic diagnostic assessment of the pronunciation of nonnative speakers based on models of the human auditory periphery. Considering each phoneme class separately, the geometric shape similarity between the native auditory domain and the non-native speech domain is measured. The phonemes that deviate the most from the native pronunciation for a set of L2 speakers are detected by comparing the geometric shape similarity measure with that calculated for native speakers on the same phonemes. To evaluate the system, we have tested it with different non-native speaker groups from various language backgrounds. The experimental results are in accordance with linguistic findings and human listeners' ratings, particularly when both the spectral and temporal cues of the speech signal are utilized in the pronunciation analysis.

  • 29. Krunic, Verica
    et al.
    Salvi, Giampiero
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH, Tal-kommunikation.
    Bernardino, Alexandre
    Montesano, Luis
    Santos-Victor, José
    Affordance based word-to-meaning association2009Ingår i: ICRA: 2009 IEEE INTERNATIONAL CONFERENCE ON ROBOTICS AND AUTOMATION, VDE Verlag GmbH, 2009, s. 4138-4143Konferensbidrag (Refereegranskat)
    Abstract [en]

    This paper presents a method to associate meanings to words in manipulation tasks. We base our model on an affordance network, i.e., a mapping between robot actions, robot perceptions and the perceived effects of these actions upon objects. We extend the affordance model to incorporate words. Using verbal descriptions of a task, the model uses temporal co-occurrence to create links between speech utterances and the involved objects, actions and effects. We show that the robot is able form useful word-to-meaning associations, even without considering grammatical structure in the learning process and in the presence of recognition errors. These word-to-meaning associations are embedded in the robot's own understanding of its actions. Thus they can be directly used to instruct the robot to perform tasks and also allow to incorporate context in the speech recognition task.

  • 30. Krunic, Verica
    et al.
    Salvi, Giampiero
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH, Tal-kommunikation.
    Bernardino, Alexandre
    Montesano, Luis
    Santos-Victor, José
    Associating word descriptions to learned manipulation task models2008Ingår i: IEEE/RSJ International Conference on Intelligent RObots and Systems (IROS), Nice, France, 2008Konferensbidrag (Refereegranskat)
    Abstract [en]

    This paper presents a method to associate meanings to words in manipulation tasks. We base our model on an affordance network, i.e., a mapping between robot actions, robot perceptions and the perceived effects of these actions upon objects. This knowledge is acquired by the robot in an unsupervised way by self-interaction with the environment. When a human user is involved in the process and describes a particular task, the robot can form associations between the (co-occurrence of) speech utterances and the involved objects, actions and effects. We extend the affordance model to incorporate a simple description of speech as a set of words. We show that, across many experiences, the robot is able form useful word-to-meaning associations, even without considering grammatical structure in the learning process and in the presence of recognition errors. Word-to-meaning associations are then used to instruct the robot to perform tasks and also allow to incorporate context in the speech recognition task.

  • 31.
    Kumar Dhaka, Akash
    et al.
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH.
    Salvi, Giampiero
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH.
    Sparse Autoencoder Based Semi-Supervised Learning for Phone Classification with Limited Annotations2017Ingår i: Grounding Language Understanding, 2017Konferensbidrag (Refereegranskat)
    Ladda ner fulltext (pdf)
    fulltext
  • 32. Lindberg, Borge
    et al.
    Johansen, Finn Tore
    Warakagoda, Narada
    Lehtinen, Gunnar
    Kacic, Zdravko
    Zgank, Andrei
    Elenius, Kjell
    Salvi, Giampiero
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH, Tal-kommunikation.
    a noise robust multilingual reference recogniser based on speechdat(II)2000Konferensbidrag (Refereegranskat)
    Abstract [en]

    An important aspect of noise robustness of automatic speech recognisers (ASR) is the proper handling of non-speech acoustic events. The present paper describes further improvements of an already existing reference recogniser towards achieving such kind of robustness. The reference recogniser applied is the COST 249 SpeechDat reference recogniser, which is a fully automatic, language-independent training procedure for building a phonetic recogniser (http://www.telenor.no/fou/prosjekter/taletek/refrec). The reference recogniser relies on the HTK toolkit and a Speech- Dat(II) compatible database, and is designed to serve as a reference system in multilingual speech recognition research. The paper describes version 0.96 of the reference recogniser which take into account labelled non-speech acoustic events during training and provides robustness against these during testing. Results are presented on small and medium vocabulary recognition for six languages.

  • 33. Lindblom, Björn
    et al.
    Diehl, Randy
    Park, Sang-Hoon
    Salvi, Giampiero
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH, Tal-kommunikation.
    (Re)use of place features in voiced stop systems: Role of phonetic constraints2008Ingår i: Proceedings of Fonetik 2008, University of Gothenburg, 2008, s. 5-8Konferensbidrag (Övrigt vetenskapligt)
    Abstract [en]

    Computational experiments focused on place of articulation in voiced stops were designed to generate ‘optimal’ inventories of CV syllables from a larger set of ‘possible CV:s’ in the presence of independently and numerically defined articulatory, perceptual and developmental constraints. Across vowel contexts the most salient places were retroflex, palatal and uvular. This was evident from acoustic measurements and perceptual data. Simulation results using the criterion of perceptual contrast alone failed to produce systems with the typologically widely attested set [b] [d] [g], whereas using articulatory cost as the sole criterion produced inventories in which bilabial, dental/alveolar and velar onsets formed the core. Neither perceptual contrast, nor articulatory cost, (nor the two combined), produced a consistent re-use of place features (‘phonemic coding’). Only systems constrained by ‘target learning’ exhibited a strong recombination of place features.

  • 34. Lindblom, Björn
    et al.
    Diehl, Randy
    Park, Sang-Hoon
    Salvi, Giampiero
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH, Tal-kommunikation.
    Sound systems are shaped by their users: The recombination of phonetic substance2011Ingår i: Where Do Phonological Features Come From?: Cognitive, physical and developmental bases of distinctive speech categories / [ed] G. Nick Clements, G. N.; Ridouane, R., John Benjamins Publishing Company, 2011, s. 67-97Kapitel i bok, del av antologi (Övrigt vetenskapligt)
    Abstract [en]

    Computational experiments were run using an optimization criterion based on independently motivated definitions of perceptual contrast, articulatory cost and learning cost. The question: If stop+vowel inventories are seen as adaptations to perceptual, articulatory and developmental constraints what would they be like? Simulations successfully predicted typologically widely observed place preferences and the re-use of place features (‘phonemic coding’) in voiced stop inventories. These results demonstrate the feasibility of user-based accounts of phonological facts and indicate the nature of the constraints that over time might shape the formation of both the formal structure and the intrinsic content of sound patterns. While phonetic factors are commonly invoked to account for substantive aspects of phonology, their explanatory scope is here also extended to a fundamental attribute of its formal organization: the combinatorial re-use of phonetic content.

  • 35.
    Lopes, José
    et al.
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH.
    Salvi, Giampiero
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH, Tal-kommunikation.
    Skantze, Gabriel
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH, Tal-kommunikation.
    Abad, A.
    Gustafson, Joakim
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH, Tal-kommunikation.
    Batista, F.
    Meena, Raveesh
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH, Tal-kommunikation.
    Trancoso, I.
    Detecting Repetitions in Spoken Dialogue Systems Using Phonetic Distances2015Ingår i: INTERSPEECH-2015, 2015, s. 1805-1809Konferensbidrag (Refereegranskat)
    Abstract [en]

    Repetitions in Spoken Dialogue Systems can be a symptom of problematic communication. Such repetitions are often due to speech recognition errors, which in turn makes it harder to use the output of the speech recognizer to detect repetitions. In this paper, we combine the alignment score obtained using phonetic distances with dialogue-related features to improve repetition detection. To evaluate the method proposed we compare several alignment techniques from edit distance to DTW-based distance, previously used in Spoken-Term detection tasks. We also compare two different methods to compute the phonetic distance: the first one using the phoneme sequence, and the second one using the distance between the phone posterior vectors. Two different datasets were used in this evaluation: a bus-schedule information system (in English) and a call routing system (in Swedish). The results show that approaches using phoneme distances over-perform approaches using Levenshtein distances between ASR outputs for repetition detection.

  • 36.
    Neiberg, Daniel
    et al.
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH, Tal-kommunikation.
    Salvi, Giampiero
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH, Tal-kommunikation.
    Gustafson, Joakim
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH, Tal-kommunikation.
    Semi-supervised methods for exploring the acoustics of simple productive feedback2013Ingår i: Speech Communication, ISSN 0167-6393, E-ISSN 1872-7182, Vol. 55, nr 3, s. 451-469Artikel i tidskrift (Refereegranskat)
    Abstract [en]

    This paper proposes methods for exploring acoustic correlates to feedback functions. A sub-language of Swedish, simple productive feedback, is introduced to facilitate investigations of the functional contributions of base tokens, phonological operations and prosody. The function of feedback is to convey the listeners' attention, understanding and affective states. In order to handle the large number of possible affective states, the current study starts by performing a listening experiment where humans annotated the functional similarity of feedback tokens with different prosodic realizations. By selecting a set of stimuli that had different prosodic distances from a reference token, it was possible to compute a generalised functional distance measure. The resulting generalised functional distance measure showed to be correlated to prosodic distance but the correlations varied as a function of base tokens and phonological operations. In a subsequent listening test, a small representative sample of feedback tokens were rated for understanding, agreement, interest, surprise and certainty. These ratings were found to explain a significant proportion of the generalised functional distance. By combining the acoustic analysis with an explorative visualisation of the prosody, we have established a map between human perception of similarity between feedback tokens, their measured distance in acoustic space, and the link to the perception of the function of feedback tokens with varying realisations.

  • 37.
    Oertel, Catharine
    et al.
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH, Tal-kommunikation.
    Salvi, Giampiero
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH, Tal-kommunikation.
    A Gaze-based Method for Relating Group Involvement to Individual Engagement in Multimodal Multiparty Dialogue2013Ingår i: ICMI 2013 - Proceedings of the 2013 ACM International Conference on Multimodal Interaction, Association for Computing Machinery (ACM), 2013, s. 99-106Konferensbidrag (Refereegranskat)
    Abstract [en]

    This paper is concerned with modelling individual engagement and group involvement as well as their relationship in an eight-party, mutimodal corpus. We propose a number of features (presence, entropy, symmetry and maxgaze) that summarise different aspects of eye-gaze patterns and allow us to describe individual as well as group behaviour in time. We use these features to define similarities between the subjects and we compare this information with the engagement rankings the subjects expressed at the end of each interactions about themselves and the other participants. We analyse how these features relate to four classes of group involvement and we build a classifier that is able to distinguish between those classes with 71% of accuracy.

  • 38.
    Oertel, Catharine
    et al.
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH, Tal-kommunikation.
    Salvi, Giampiero
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH, Tal-kommunikation.
    Götze, Jana
    KTH, Skolan för datavetenskap och kommunikation (CSC), Teoretisk datalogi, TCS.
    Edlund, Jens
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH, Tal-kommunikation.
    Gustafson, Joakim
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH, Tal-kommunikation.
    Heldner, Mattias
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH, Tal-kommunikation.
    The KTH Games Corpora: How to Catch a Werewolf2013Ingår i: IVA 2013 Workshop Multimodal Corpora: Beyond Audio and Video: MMC 2013, 2013Konferensbidrag (Refereegranskat)
  • 39. Pieropan, Alessandro
    et al.
    Salvi, Giampiero
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH, Tal-kommunikation.
    Pauwels, Karl
    Kjellström, Hedvig
    A dataset of human manipulation actions2014Ingår i: ICRA 2014 Workshop on Autonomous Grasping and Manipulation: An Open Challenge, 2014, Hong Kong, China, 2014Konferensbidrag (Refereegranskat)
    Abstract [en]

    We present a data set of human activities that includes both visual data (RGB-D video and six Degrees Of Freedom (DOF) object pose estimation) and acoustic data. Our vision is that robots need to merge information from multiple perceptional modalities to operate robustly and autonomously in an unstructured environment.

  • 40.
    Pieropan, Alessandro
    et al.
    KTH, Skolan för datavetenskap och kommunikation (CSC), Datorseende och robotik, CVAP. KTH, Skolan för datavetenskap och kommunikation (CSC), Centra, Centrum för Autonoma System, CAS.
    Salvi, Giampiero
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH.
    Pauwels, Karl
    Universidad de Granada, Spain.
    Kjellström, Hedvig
    KTH, Skolan för datavetenskap och kommunikation (CSC), Datorseende och robotik, CVAP. KTH, Skolan för datavetenskap och kommunikation (CSC), Centra, Centrum för Autonoma System, CAS.
    Audio-Visual Classification and Detection of Human Manipulation Actions2014Ingår i: 2014 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2014), IEEE conference proceedings, 2014, s. 3045-3052Konferensbidrag (Refereegranskat)
    Abstract [en]

    Humans are able to merge information from multiple perceptional modalities and formulate a coherent representation of the world. Our thesis is that robots need to do the same in order to operate robustly and autonomously in an unstructured environment. It has also been shown in several fields that multiple sources of information can complement each other, overcoming the limitations of a single perceptual modality. Hence, in this paper we introduce a data set of actions that includes both visual data (RGB-D video and 6DOF object pose estimation) and acoustic data. We also propose a method for recognizing and segmenting actions from continuous audio-visual data. The proposed method is employed for extensive evaluation of the descriptive power of the two modalities, and we discuss how they can be used jointly to infer a coherent interpretation of the recorded action.

    Ladda ner fulltext (pdf)
    fulltext
  • 41.
    Rugayan, Janine
    et al.
    Department of Electronic Systems, NTNU, Norway.
    Salvi, Giampiero
    KTH, Skolan för elektroteknik och datavetenskap (EECS), Intelligenta system, Tal, musik och hörsel, TMH. Department of Electronic Systems, NTNU, Norway.
    Svendsen, Torbjørn
    Department of Electronic Systems, NTNU, Norway.
    Perceptual and Task-Oriented Assessment of a Semantic Metric for ASR Evaluation2023Ingår i: Interspeech 2023, International Speech Communication Association , 2023, s. 2158-2162Konferensbidrag (Refereegranskat)
    Abstract [en]

    Automatic speech recognition (ASR) systems have become a vital part of our everyday lives through their many applications. However, as much as we have developed in this regard, our most common evaluation method for ASR systems still remains to be word error rate (WER). WER does not give information on the severity of errors, which strongly impacts practical performance. As such, we examine a semantic-based metric called Aligned Semantic Distance (ASD) against WER and demonstrate its advantage over WER in two facets. First, we conduct a survey asking participants to score reference text and ASR transcription pairs. We perform a correlation analysis and show that ASD is more correlated to the human evaluation scores compared to WER. We also explore the feasibility of predicting human perception using ASD. Second, we demonstrate that ASD is more effective than WER as an indicator of performance on downstream NLP tasks such as named entity recognition and sentiment classification.

  • 42. Rugayan, Janine
    et al.
    Svendsen, Torbjørn
    Salvi, Giampiero
    NTNU.
    Semantically Meaningful Metrics for Norwegian ASR Systems2022Konferensbidrag (Refereegranskat)
    Ladda ner fulltext (pdf)
    fulltext
  • 43. Sabzi Shahrebabak, Abdolreza
    et al.
    Siniscalchi, Sabato Marco
    Salvi, Giampiero
    Svendsen, Torbjorn
    A DNN Based Speech Enhancement Approach to Noise Robust Acoustic-to-Articulatory Inversion2021Konferensbidrag (Refereegranskat)
    Ladda ner fulltext (pdf)
    fulltext
  • 44.
    Salvi, Giampiero
    KTH, Tidigare Institutioner (före 2005), Tal, musik och hörsel.
    Accent clustering in Swedish using the Bhattacharyya distance2003Ingår i: Proceedings of the 15th International Congress of Phonetic Sciences (ICPhS), Barcelona Spain, 2003, s. 1149-1152Konferensbidrag (Refereegranskat)
    Abstract [en]

    In an attempt to improve automatic speech recognition(ASR) models for Swedish, accent variations wereconsidered. These have proved to be important variablesin the statistical distribution of the acoustic featuresusually employed in ASR. The analysis of featurevariability have revealed phenomena that are consistentwith what is known from phonetic investigations,suggesting that a consistent part of the informationabout accents could be derived form those features. Agraphical interface has been developed to simplify thevisualization of the geographical distributions of thesephenomena.

  • 45.
    Salvi, Giampiero
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH.
    Advances in regional accent clustering in Swedish2005Ingår i: Proceedings of European Conference on Speech Communication and Technology (Eurospeech), 2005, s. 2841-2844Konferensbidrag (Refereegranskat)
    Abstract [en]

    The regional pronunciation variation in Swedish is analysed on a large database. Statistics over each phoneme and for each region of Sweden are computed using the EM algorithm in a hidden Markov model framework to overcome the difficulties of transcribing the whole set of data at the phonetic level. The model representations obtained this way are compared using a distance measure in the space spanned by the model parameters, and hierarchical clustering. The regional variants of each phoneme may group with those of any other phoneme, on the basis of their acoustic properties. The log likelihood of the data given the model is shown to display interesting properties regarding the choice of number of clusters, given a particular level of details. Discriminative analysis is used to find the parameters that most contribute to the separation between groups, adding an interpretative value to the discussion. Finally a number of examples are given on some of the phenomena that are revealed by examining the clustering tree.

  • 46.
    Salvi, Giampiero
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH, Tal-kommunikation.
    An Analysis of Shallow and Deep Representations of Speech Based on Unsupervised Classification of Isolated Words2016Ingår i: Recent Advances in Nonlinear Speech Processing, Springer, 2016, Vol. 48, s. 151-157Konferensbidrag (Refereegranskat)
    Abstract [en]

    We analyse the properties of shallow and deep representa-tions of speech. Mel frequency cepstral coefficients (MFCC) are compared to representations learned by a four layer Deep Belief Network (DBN) in terms of discriminative power and invariance to irrelevant factors such as speaker identity or gender. To avoid the influence of supervised statistical modelling, an unsupervised isolated word classification task is used for the comparison. The deep representations are also obtained with unsupervised training (no back-propagation pass is performed). The results show that DBN features provide a more concise clustering and higher match between clusters and word categories in terms of adjusted Rand score. Some of the confusions present with the MFCC features are, however, retained even with the DBN features.

  • 47.
    Salvi, Giampiero
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH, Tal-kommunikation.
    Biologically Inspired Methods for Automatic Speech Understanding2013Ingår i: Biologically Inspired Cognitive Architectures 2012 / [ed] Chella, A; Pirrone, R; Sorbello, R; Johannsdottir, KR, Springer, 2013, Vol. 196, s. 283-286Konferensbidrag (Refereegranskat)
    Abstract [en]

    Automatic Speech Recognition (ASR) and Understanding (ASU) systems heavily rely on machine learning techniques to solve the problem of mapping spoken utterances into words and meanings. The statistical methods employed, however, greatly deviate from the processes involved in human language acquisition in a number of key aspects. Although ASR and ASU have recently reached a level of accuracy that is sufficient for some practical applications, there are still severe limitations due, for example, to the amount of training data required and the lack of generalization of the resulting models. In our opinion, there is a need for a paradigm shift and speech technology should address some of the challenges that humans face when learning a first language and that are currently ignored by the ASR and ASU methods. In this paper, we point out some of the aspects that could lead to more robust and flexible models, and we describe some of the research we and other researchers have performed in the area.

  • 48.
    Salvi, Giampiero
    KTH, Tidigare Institutioner (före 2005), Tal, musik och hörsel.
    Developing acoustic models for automatic speech recognition in swedish1999Ingår i: The European Student Journal of Language and Speech, Vol. 1Artikel i tidskrift (Refereegranskat)
    Abstract [en]

    This thesis is concerned with automatic continuous speech recognition using trainable systems. The aim of this work is to build acoustic models for spoken Swedish. This is done employing hidden Markov models and using the SpeechDat database to train their parameters. Acoustic modeling has been worked out at a phonetic level, allowing general speech recognition applications, even though a simplified task (digits and natural number recognition) has been considered for model evaluation. Different kinds of phone models have been tested, including context independent models and two variations of context dependent models. Furthermore many experiments have been done with bigram language models to tune some of the system parameters. System performance over various speaker subsets with different sex, age and dialect has also been examined. Results are compared to previous similar studies showing a remarkable improvement.

  • 49.
    Salvi, Giampiero
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH.
    Dynamic behaviour of connectionist speech recognition with strong latency constraints2006Ingår i: Speech Communication, ISSN 0167-6393, E-ISSN 1872-7182, Vol. 48, nr 7, s. 802-818Artikel i tidskrift (Refereegranskat)
    Abstract [en]

    This paper describes the use of connectionist techniques in phonetic speech recognition with strong latency constraints. The constraints are imposed by the task of deriving the lip movements of a synthetic face in real time from the speech signal, by feeding the phonetic string into an articulatory synthesiser. Particular attention has been paid to analysing the interaction between the time evolution model learnt by the multi-layer perceptrons and the transition model imposed by the Viterbi decoder, in different latency conditions. Two experiments were conducted in which the time dependencies in the language model (LM) were controlled by a parameter. The results show a strong interaction between the three factors involved, namely the neural network topology, the length of time dependencies in the LM and the decoder latency.

    Ladda ner fulltext (pdf)
    dynamicbehaviour
  • 50.
    Salvi, Giampiero
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH.
    Ecological language acquisition via incremental model-based clustering2005Ingår i: Proceedings of European Conference on Speech Communication and Technology (Eurospeech), Springer, 2005, s. 1181-1184Konferensbidrag (Refereegranskat)
    Abstract [en]

    We analyse the behaviour of Incremental Model-Based Clustering on child-directed speech data, and suggest a possible use of this method to describe the acquisition of phonetic classes by an infant. The effects of two factors are analysed, namely the number of coefficients describing the speech signal, and the frame length of the incremental clustering procedure. The results show that, although the number of predicted clusters vary in different conditions, the classifications obtained are essentially consistent. Different classifications were compared using the variation of information measure.

12 1 - 50 av 86
RefereraExporteraLänk till träfflistan
Permanent länk
Referera
Referensformat
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Annat format
Fler format
Språk
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Annat språk
Fler språk
Utmatningsformat
  • html
  • text
  • asciidoc
  • rtf