Change search
Link to record
Permanent link

Direct link
BETA
Alternative names
Publications (10 of 98) Show all publications
Malisz, Z., Berthelsen, H., Beskow, J. & Gustafson, J. (2017). Controlling prominence realisation in parametric DNN-based speech synthesis. In: Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH 2017: . Paper presented at 18th Annual Conference of the International Speech Communication Association, INTERSPEECH 2017, Stockholm, Sweden, 20 August 2017 through 24 August 2017 (pp. 1079-1083). International Speech Communication Association, 2017.
Open this publication in new window or tab >>Controlling prominence realisation in parametric DNN-based speech synthesis
2017 (English)In: Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH 2017, International Speech Communication Association , 2017, Vol. 2017, p. 1079-1083Conference paper, Published paper (Refereed)
Abstract [en]

This work aims to improve text-To-speech synthesis forWikipedia by advancing and implementing models of prosodic prominence. We propose a new system architecture with explicit prominence modeling and test the first component of the architecture. We automatically extract a phonetic feature related to prominence from the speech signal in the ARCTIC corpus. We then modify the label files and train an experimental TTS system based on the feature using Merlin, a statistical-parametric DNN-based engine. Test sentences with contrastive prominence on the word-level are synthesised and separate listening tests a) evaluating the level of prominence control in generated speech, and b) naturalness, are conducted. Our results show that the prominence feature-enhanced system successfully places prominence on the appropriate words and increases perceived naturalness relative to the baseline.

Place, publisher, year, edition, pages
International Speech Communication Association, 2017
Series
Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, ISSN 2308-457X
Keyword
Deep neural networks, Prosodic prominence, Speech synthesis
National Category
Language Technology (Computational Linguistics)
Identifiers
urn:nbn:se:kth:diva-222092 (URN)10.21437/Interspeech.2017-1355 (DOI)2-s2.0-85039164235 (Scopus ID)
Conference
18th Annual Conference of the International Speech Communication Association, INTERSPEECH 2017, Stockholm, Sweden, 20 August 2017 through 24 August 2017
Note

QC 20180131

Available from: 2018-01-31 Created: 2018-01-31 Last updated: 2018-01-31Bibliographically approved
Szekely, E., Mendelson, J. & Gustafson, J. (2017). Synthesising uncertainty: The interplay of vocal effort and hesitation disfluencies. In: Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH: . Paper presented at 18th Annual Conference of the International Speech Communication Association, INTERSPEECH 2017, Stockholm, Sweden, 20 August 2017 through 24 August 2017 (pp. 804-808). International Speech Communication Association, 2017.
Open this publication in new window or tab >>Synthesising uncertainty: The interplay of vocal effort and hesitation disfluencies
2017 (English)In: Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, International Speech Communication Association , 2017, Vol. 2017, p. 804-808Conference paper (Refereed)
Abstract [en]

As synthetic voices become more flexible, and conversational systems gain more potential to adapt to the environmental and social situation, the question needs to be examined, how different modifications to the synthetic speech interact with each other and how their specific combinations influence perception. This work investigates how the vocal effort of the synthetic speech together with added disfluencies affect listeners' perception of the degree of uncertainty in an utterance. We introduce a DNN voice built entirely from spontaneous conversational speech data and capable of producing a continuum of vocal efforts, prolongations and filled pauses with a corpus-based method. Results of a listener evaluation indicate that decreased vocal effort, filled pauses and prolongation of function words increase the degree of perceived uncertainty of conversational utterances expressing the speaker's beliefs. We demonstrate that the effect of these three cues are not merely additive, but that interaction effects, in particular between the two types of disfluencies and between vocal effort and prolongations need to be considered when aiming to communicate a specific level of uncertainty. The implications of these findings are relevant for adaptive and incremental conversational systems using expressive speech synthesis and aspiring to communicate the attitude of uncertainty.

Place, publisher, year, edition, pages
International Speech Communication Association, 2017
Series
Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, ISSN 2308-457X
Keyword
Conversational Systems, Disfluencies, Speech Synthesis, Uncertainty, Vocal Effort
National Category
Communication Studies
Identifiers
urn:nbn:se:kth:diva-220749 (URN)10.21437/Interspeech.2017-1507 (DOI)2-s2.0-85039172286 (Scopus ID)
Conference
18th Annual Conference of the International Speech Communication Association, INTERSPEECH 2017, Stockholm, Sweden, 20 August 2017 through 24 August 2017
Note

QC 20180105

Available from: 2018-01-05 Created: 2018-01-05 Last updated: 2018-01-05Bibliographically approved
Oertel, C., Jonell, P., Haddad, K. E., Szekely, E. & Gustafson, J. (2017). Using crowd-sourcing for the design of listening agents: Challenges and opportunities. In: ISIAA 2017 - Proceedings of the 1st ACM SIGCHI International Workshop on Investigating Social Interactions with Artificial Agents, Co-located with ICMI 2017: . Paper presented at 1st ACM SIGCHI International Workshop on Investigating Social Interactions with Artificial Agents, ISIAA 2017, Glasgow, United Kingdom, 13 November 2017 (pp. 37-38). Association for Computing Machinery (ACM).
Open this publication in new window or tab >>Using crowd-sourcing for the design of listening agents: Challenges and opportunities
Show others...
2017 (English)In: ISIAA 2017 - Proceedings of the 1st ACM SIGCHI International Workshop on Investigating Social Interactions with Artificial Agents, Co-located with ICMI 2017, Association for Computing Machinery (ACM), 2017, p. 37-38Conference paper, Published paper (Refereed)
Abstract [en]

In this paper we are describing how audio-visual corpora recordings using crowd-sourcing techniques can be used for the audio-visual synthesis of attitudinal non-verbal feedback expressions for virtual agents. We are discussing the limitations of this approach as well as where we see the opportunities for this technology.

Place, publisher, year, edition, pages
Association for Computing Machinery (ACM), 2017
Keyword
Artificial listener, Listening agent, Multimodal behaviour generation
National Category
Interaction Technologies
Identifiers
urn:nbn:se:kth:diva-222507 (URN)10.1145/3139491.3139499 (DOI)2-s2.0-85041230172 (Scopus ID)9781450355582 (ISBN)
Conference
1st ACM SIGCHI International Workshop on Investigating Social Interactions with Artificial Agents, ISIAA 2017, Glasgow, United Kingdom, 13 November 2017
Note

QC 20180212

Available from: 2018-02-12 Created: 2018-02-12 Last updated: 2018-02-12Bibliographically approved
Johansson, M., Hori, T., Skantze, G., Hothker, A. & Gustafson, J. (2016). Making Turn-Taking Decisions for an Active Listening Robot for Memory Training. In: SOCIAL ROBOTICS, (ICSR 2016): . Paper presented at 8th International Conference on Social Robotics (ICSR), NOV 01-03, 2016, Kansas City, MO (pp. 940-949). Springer.
Open this publication in new window or tab >>Making Turn-Taking Decisions for an Active Listening Robot for Memory Training
Show others...
2016 (English)In: SOCIAL ROBOTICS, (ICSR 2016), Springer, 2016, p. 940-949Conference paper, Published paper (Refereed)
Abstract [en]

In this paper we present a dialogue system and response model that allows a robot to act as an active listener, encouraging users to tell the robot about their travel memories. The response model makes a combined decision about when to respond and what type of response to give, in order to elicit more elaborate descriptions from the user and avoid non-sequitur responses. The model was trained on human-robot dialogue data collected in a Wizard-of-Oz setting, and evaluated in a fully autonomous version of the same dialogue system. Compared to a baseline system, users perceived the dialogue system with the trained model to be a significantly better listener. The trained model also resulted in dialogues with significantly fewer mistakes, a larger proportion of user speech and fewer interruptions.

Place, publisher, year, edition, pages
Springer, 2016
Series
Lecture Notes in Artificial Intelligence, ISSN 0302-9743 ; 9979
Keyword
Turn-taking, Active listening, Social robotics, Memory training
National Category
Computer and Information Sciences
Identifiers
urn:nbn:se:kth:diva-200064 (URN)10.1007/978-3-319-47437-3_92 (DOI)000389816500092 ()2-s2.0-84992499074 (Scopus ID)978-3-319-47437-3 (ISBN)978-3-319-47436-6 (ISBN)
Conference
8th International Conference on Social Robotics (ICSR), NOV 01-03, 2016, Kansas City, MO
Note

QC 20170125

Available from: 2017-01-25 Created: 2017-01-20 Last updated: 2018-01-13Bibliographically approved
Edlund, J., Tånnander, C. & Gustafson, J. (2015). Audience response system-based assessment for analysis-by-synthesis. In: Proc. of ICPhS 2015: . Paper presented at ICPhS 2015. ICPhS.
Open this publication in new window or tab >>Audience response system-based assessment for analysis-by-synthesis
2015 (English)In: Proc. of ICPhS 2015, ICPhS , 2015Conference paper, Published paper (Refereed)
Place, publisher, year, edition, pages
ICPhS, 2015
National Category
Computer Sciences Language Technology (Computational Linguistics)
Identifiers
urn:nbn:se:kth:diva-180399 (URN)
Conference
ICPhS 2015
Note

QC 20160317

Available from: 2016-01-13 Created: 2016-01-13 Last updated: 2018-01-10Bibliographically approved
Meena, R., David Lopes, J., Skantze, G. & Gustafson, J. (2015). Automatic Detection of Miscommunication in Spoken Dialogue Systems. In: Proceedings of 16th Annual Meeting of the Special Interest Group on Discourse and Dialogue (SIGDIAL): . Paper presented at 16th Annual Meeting of the Special Interest Group on Discourse and Dialogue (SIGDIAL) (pp. 354-363). .
Open this publication in new window or tab >>Automatic Detection of Miscommunication in Spoken Dialogue Systems
2015 (English)In: Proceedings of 16th Annual Meeting of the Special Interest Group on Discourse and Dialogue (SIGDIAL), 2015, p. 354-363Conference paper, Published paper (Refereed)
Abstract [en]

In this paper, we present a data-driven approach for detecting instances of miscommunication in dialogue system interactions. A range of generic features that are both automatically extractable and manually annotated were used to train two models for online detection and one for offline analysis. Online detection could be used to raise the error awareness of the system, whereas offline detection could be used by a system designer to identify potential flaws in the dialogue design. In experimental evaluations on system logs from three different dialogue systems that vary in their dialogue strategy, the proposed models performed substantially better than the majority class baseline models.

National Category
Computer Sciences Language Technology (Computational Linguistics)
Identifiers
urn:nbn:se:kth:diva-180406 (URN)2-s2.0-84988311476 (Scopus ID)
Conference
16th Annual Meeting of the Special Interest Group on Discourse and Dialogue (SIGDIAL)
Note

QC 20160120

Available from: 2016-01-13 Created: 2016-01-13 Last updated: 2018-01-10Bibliographically approved
Oertel, C., Funes, K., Gustafson, J. & Odobez, J.-M. (2015). Deciphering the Silent Participant: On the Use of Audio-Visual Cues for the Classification of Listener Categories in Group Discussions. In: Proccedings of ICMI 2015: . Paper presented at ICMI 2015. ACM Digital Library.
Open this publication in new window or tab >>Deciphering the Silent Participant: On the Use of Audio-Visual Cues for the Classification of Listener Categories in Group Discussions
2015 (English)In: Proccedings of ICMI 2015, ACM Digital Library, 2015Conference paper, Published paper (Refereed)
Abstract [en]

Estimating a silent participant's degree of engagement and his role within a group discussion can be challenging, as there are no speech related cues available at the given time. Having this information available, however, can provide important insights into the dynamics of the group as a whole. In this paper, we study the classification of listeners into several categories (attentive listener, side participant and bystander). We devised a thin-sliced perception test where subjects were asked to assess listener roles and engagement levels in 15-second video-clips taken from a corpus of group interviews. Results show that humans are usually able to assess silent participant roles. Using the annotation to identify from a set of multimodal low-level features, such as past speaking activity, backchannels (both visual and verbal), as well as gaze patterns, we could identify the features which are able to distinguish between different listener categories. Moreover, the results show that many of the audio-visual effects observed on listeners in dyadic interactions, also hold for multi-party interactions. A preliminary classifier achieves an accuracy of 64 %.

Place, publisher, year, edition, pages
ACM Digital Library, 2015
National Category
Computer Sciences Language Technology (Computational Linguistics)
Identifiers
urn:nbn:se:kth:diva-180426 (URN)10.1145/2818346.2820759 (DOI)000380609500018 ()2-s2.0-84959309012 (Scopus ID)978-1-4503-3912-4 (ISBN)
Conference
ICMI 2015
Note

QC 20160121

Available from: 2016-01-13 Created: 2016-01-13 Last updated: 2018-01-10Bibliographically approved
Lopes, J., Salvi, G., Skantze, G., Abad, A., Gustafson, J., Batista, F., . . . Trancoso, I. (2015). Detecting Repetitions in Spoken Dialogue Systems Using Phonetic Distances. In: INTERSPEECH-2015: . Paper presented at INTERSPEECH-2015, Dresden, Germany (pp. 1805-1809). .
Open this publication in new window or tab >>Detecting Repetitions in Spoken Dialogue Systems Using Phonetic Distances
Show others...
2015 (English)In: INTERSPEECH-2015, 2015, p. 1805-1809Conference paper, Published paper (Refereed)
Abstract [en]

Repetitions in Spoken Dialogue Systems can be a symptom of problematic communication. Such repetitions are often due to speech recognition errors, which in turn makes it harder to use the output of the speech recognizer to detect repetitions. In this paper, we combine the alignment score obtained using phonetic distances with dialogue-related features to improve repetition detection. To evaluate the method proposed we compare several alignment techniques from edit distance to DTW-based distance, previously used in Spoken-Term detection tasks. We also compare two different methods to compute the phonetic distance: the first one using the phoneme sequence, and the second one using the distance between the phone posterior vectors. Two different datasets were used in this evaluation: a bus-schedule information system (in English) and a call routing system (in Swedish). The results show that approaches using phoneme distances over-perform approaches using Levenshtein distances between ASR outputs for repetition detection.

National Category
Computer Sciences Language Technology (Computational Linguistics)
Identifiers
urn:nbn:se:kth:diva-180405 (URN)000380581600375 ()2-s2.0-84959138120 (Scopus ID)978-1-5108-1790-6 (ISBN)
Conference
INTERSPEECH-2015, Dresden, Germany
Note

QC 20160216

Available from: 2016-01-13 Created: 2016-01-13 Last updated: 2018-01-10Bibliographically approved
Bollepalli, B., Urbain, J., Raitio, T., Gustafson, J. & Cakmak, H. (2014). A COMPARATIVE EVALUATION OF VOCODING TECHNIQUES FOR HMM-BASED LAUGHTER SYNTHESIS. In: : . Paper presented at IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), MAY 04-09, 2014, Florence, ITALY (pp. 255-259). .
Open this publication in new window or tab >>A COMPARATIVE EVALUATION OF VOCODING TECHNIQUES FOR HMM-BASED LAUGHTER SYNTHESIS
Show others...
2014 (English)Conference paper, Published paper (Refereed)
Abstract [en]

This paper presents an experimental comparison of various leading vocoders for the application of HMM-based laughter synthesis. Four vocoders, commonly used in HMM-based speech synthesis, are used in copy-synthesis and HMM-based synthesis of both male and female laughter. Subjective evaluations are conducted to assess the performance of the vocoders. The results show that all vocoders perform relatively well in copy-synthesis. In HMM-based laughter synthesis using original phonetic transcriptions, all synthesized laughter voices were significantly lower in quality than copy-synthesis, indicating a challenging task and room for improvements. Interestingly, two vocoders using rather simple and robust excitation modeling performed the best, indicating that robustness in speech parameter extraction and simple parameter representation in statistical modeling are key factors in successful laughter synthesis.

Series
International Conference on Acoustics Speech and Signal Processing ICASSP, ISSN 1520-6149
Keyword
Laughter synthesis, vocoder, mel-cepstrum, STRAIGHT, DSM, GlottHMM, HTS, HMM
National Category
Fluid Mechanics and Acoustics
Identifiers
urn:nbn:se:kth:diva-158336 (URN)10.1109/ICASSP.2014.6853597 (DOI)000343655300052 ()2-s2.0-84905269196 (Scopus ID)978-1-4799-2893-4 (ISBN)978-147992892-7 (ISBN)
Conference
IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), MAY 04-09, 2014, Florence, ITALY
Note

QC 20150123

Available from: 2015-01-23 Created: 2015-01-07 Last updated: 2016-12-12Bibliographically approved
Johansson, M., Skantze, G. & Gustafson, J. (2014). Comparison of human-human and human-robot Turn-taking Behaviour in multi-party Situated interaction. In: UM3I '14: Proceedings of the 2014 workshop on Understanding and Modeling Multiparty, Multimodal Interactions. Paper presented at International Workshop on Understanding and Modeling Multiparty, Multimodal Interactions, at ICMI 2014, Bogazici University, Istanbul, Turkey. November 12-16th, 2014 (pp. 21-26). Istanbul, Turkey.
Open this publication in new window or tab >>Comparison of human-human and human-robot Turn-taking Behaviour in multi-party Situated interaction
2014 (English)In: UM3I '14: Proceedings of the 2014 workshop on Understanding and Modeling Multiparty, Multimodal Interactions, Istanbul, Turkey, 2014, p. 21-26Conference paper, Published paper (Refereed)
Abstract [en]

In this paper, we present an experiment where two human subjects are given a team-building task to solve together with a robot. The setting requires that the speakers' attention is partly directed towards objects on the table between them, as well as to each other, in order to coordinate turn-taking. The symmetrical setup allows us to compare human-human and human-robot turn-taking behaviour in the same interactional setting. The analysis centres around the interlocutors' attention (as measured by head pose) and gap length between turns, depending on the pragmatic function of the utterances.

Place, publisher, year, edition, pages
Istanbul, Turkey: , 2014
National Category
Computer Sciences Language Technology (Computational Linguistics)
Identifiers
urn:nbn:se:kth:diva-158170 (URN)10.1145/2666242.2666249 (DOI)2-s2.0-84919383902 (Scopus ID)978-1-4503-0652-2 (ISBN)
Conference
International Workshop on Understanding and Modeling Multiparty, Multimodal Interactions, at ICMI 2014, Bogazici University, Istanbul, Turkey. November 12-16th, 2014
Note

QC 20150417.

tmh_import_14_12_30, tmh_id_3949

Available from: 2014-12-30 Created: 2014-12-30 Last updated: 2018-01-11Bibliographically approved
Organisations
Identifiers
ORCID iD: ORCID iD iconorcid.org/0000-0002-0397-6442

Search in DiVA

Show all publications