Change search
Link to record
Permanent link

Direct link
BETA
Alternative names
Publications (10 of 114) Show all publications
Székely, É., Henter, G. E. & Gustafson, J. (2019). CASTING TO CORPUS: SEGMENTING AND SELECTING SPONTANEOUS DIALOGUE FOR TTS WITH A CNN-LSTM SPEAKER-DEPENDENT BREATH DETECTOR. In: 2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP): . Paper presented at 44th IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), MAY 12-17, 2019, Brighton, ENGLAND (pp. 6925-6929). IEEE
Open this publication in new window or tab >>CASTING TO CORPUS: SEGMENTING AND SELECTING SPONTANEOUS DIALOGUE FOR TTS WITH A CNN-LSTM SPEAKER-DEPENDENT BREATH DETECTOR
2019 (English)In: 2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), IEEE , 2019, p. 6925-6929Conference paper, Published paper (Refereed)
Abstract [en]

This paper considers utilising breaths to create improved spontaneous-speech corpora for conversational text-to-speech from found audio recordings such as dialogue podcasts. Breaths are of interest since they relate to prosody and speech planning and are independent of language and transcription. Specifically, we propose a semisupervised approach where a fraction of coarsely annotated data is used to train a convolutional and recurrent speaker-specific breath detector operating on spectrograms and zero-crossing rate. The classifier output is used to find target-speaker breath groups (audio segments delineated by breaths) and subsequently select those that constitute clean utterances appropriate for a synthesis corpus. An application to 11 hours of raw podcast audio extracts 1969 utterances (106 minutes), 87% of which are clean and correctly segmented. This outperforms a baseline that performs integrated VAD and speaker attribution without accounting for breaths.

Place, publisher, year, edition, pages
IEEE, 2019
Series
International Conference on Acoustics Speech and Signal Processing ICASSP, ISSN 1520-6149
Keywords
Spontaneous speech, found data, speech synthesis corpora, breath detection, computational paralinguistics
National Category
Language Technology (Computational Linguistics)
Identifiers
urn:nbn:se:kth:diva-261049 (URN)10.1109/ICASSP.2019.8683846 (DOI)000482554007032 ()2-s2.0-85069442973 (Scopus ID)978-1-4799-8131-1 (ISBN)
Conference
44th IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), MAY 12-17, 2019, Brighton, ENGLAND
Note

QC 20191002

Available from: 2019-10-02 Created: 2019-10-02 Last updated: 2019-10-02Bibliographically approved
Székely, É., Henter, G. E., Beskow, J. & Gustafson, J. (2019). How to train your fillers: uh and um in spontaneous speech synthesis. In: : . Paper presented at The 10th ISCA Speech Synthesis Workshop.
Open this publication in new window or tab >>How to train your fillers: uh and um in spontaneous speech synthesis
2019 (English)Conference paper, Published paper (Refereed)
National Category
Engineering and Technology
Identifiers
urn:nbn:se:kth:diva-261693 (URN)
Conference
The 10th ISCA Speech Synthesis Workshop
Note

QC 20191011

Available from: 2019-10-10 Created: 2019-10-10 Last updated: 2019-10-11Bibliographically approved
Skantze, G., Gustafson, J. & Beskow, J. (2019). Multimodal Conversational Interaction with Robots. In: Sharon Oviatt, Björn Schuller, Philip R. Cohen, Daniel Sonntag, Gerasimos Potamianos, Antonio Krüger (Ed.), The Handbook of Multimodal-Multisensor Interfaces, Volume 3: Language Processing, Software, Commercialization, and Emerging Directions. ACM Press
Open this publication in new window or tab >>Multimodal Conversational Interaction with Robots
2019 (English)In: The Handbook of Multimodal-Multisensor Interfaces, Volume 3: Language Processing, Software, Commercialization, and Emerging Directions / [ed] Sharon Oviatt, Björn Schuller, Philip R. Cohen, Daniel Sonntag, Gerasimos Potamianos, Antonio Krüger, ACM Press, 2019Chapter in book (Refereed)
Place, publisher, year, edition, pages
ACM Press, 2019
National Category
Human Computer Interaction
Identifiers
urn:nbn:se:kth:diva-254650 (URN)9781970001723 (ISBN)
Note

QC 20190821

Available from: 2019-07-02 Created: 2019-07-02 Last updated: 2019-08-21Bibliographically approved
Székely, É., Henter, G. E., Beskow, J. & Gustafson, J. (2019). Off the cuff: Exploring extemporaneous speech delivery with TTS. In: : . Paper presented at Interspeech.
Open this publication in new window or tab >>Off the cuff: Exploring extemporaneous speech delivery with TTS
2019 (English)Conference paper, Published paper (Refereed)
National Category
Engineering and Technology
Identifiers
urn:nbn:se:kth:diva-261691 (URN)
Conference
Interspeech
Note

QC 20191011

Available from: 2019-10-10 Created: 2019-10-10 Last updated: 2019-10-11Bibliographically approved
Székely, É., Henter, G. E., Beskow, J. & Gustafson, J. (2019). Off the cuff: Exploring extemporaneous speech delivery with TTS. In: : . Paper presented at The 20th Annual Conference of the International Speech Communication Association INTERSPEECH 2019 | Graz, Austria, Sep. 15-19, 2019. (pp. 3687-3688).
Open this publication in new window or tab >>Off the cuff: Exploring extemporaneous speech delivery with TTS
2019 (English)Conference paper, Published paper (Refereed)
National Category
Computer Sciences Language Technology (Computational Linguistics)
Identifiers
urn:nbn:se:kth:diva-260957 (URN)
Conference
The 20th Annual Conference of the International Speech Communication Association INTERSPEECH 2019 | Graz, Austria, Sep. 15-19, 2019.
Note

QC 20191113

Available from: 2019-09-30 Created: 2019-09-30 Last updated: 2019-11-13Bibliographically approved
Székely, É., Henter, G. E., Beskow, J. & Gustafson, J. (2019). Spontaneous conversational speech synthesis from found data. In: : . Paper presented at Interspeech.
Open this publication in new window or tab >>Spontaneous conversational speech synthesis from found data
2019 (English)Conference paper, Published paper (Refereed)
National Category
Engineering and Technology
Identifiers
urn:nbn:se:kth:diva-261689 (URN)
Conference
Interspeech
Note

QC 20191011

Available from: 2019-10-10 Created: 2019-10-10 Last updated: 2019-10-11Bibliographically approved
Kontogiorgos, D., Abelho Pereira, A. T., Andersson, O., Koivisto, M., Gonzalez Rabal, E., Vartiainen, V. & Gustafson, J. (2019). The effects of anthropomorphism and non-verbal social behaviour in virtual assistants. In: IVA 2019 - Proceedings of the 19th ACM International Conference on Intelligent Virtual Agents: . Paper presented at 19th ACM International Conference on Intelligent Virtual Agents, IVA 2019; Paris; France; 2 July 2019 through 5 July 2019 (pp. 133-140). Association for Computing Machinery (ACM)
Open this publication in new window or tab >>The effects of anthropomorphism and non-verbal social behaviour in virtual assistants
Show others...
2019 (English)In: IVA 2019 - Proceedings of the 19th ACM International Conference on Intelligent Virtual Agents, Association for Computing Machinery (ACM), 2019, p. 133-140Conference paper, Published paper (Refereed)
Abstract [en]

The adoption of virtual assistants is growing at a rapid pace. However, these assistants are not optimised to simulate key social aspects of human conversational environments. Humans are intellectually biased toward social activity when facing anthropomorphic agents or when presented with subtle social cues. In this paper, we test whether humans respond the same way to assistants in guided tasks, when in different forms of embodiment and social behaviour. In a within-subject study (N=30), we asked subjects to engage in dialogue with a smart speaker and a social robot.We observed shifting of interactive behaviour, as shown in behavioural and subjective measures. Our findings indicate that it is not always favourable for agents to be anthropomorphised or to communicate with nonverbal cues. We found a trade-off between task performance and perceived sociability when controlling for anthropomorphism and social behaviour.

Place, publisher, year, edition, pages
Association for Computing Machinery (ACM), 2019
Keywords
Conversational artificial intelligence, Empirical studies, Human-computer interaction, Smart speakers, Social robots
National Category
Computer and Information Sciences
Identifiers
urn:nbn:se:kth:diva-262610 (URN)10.1145/3308532.3329466 (DOI)2-s2.0-85069732636 (Scopus ID)9781450366724 (ISBN)
Conference
19th ACM International Conference on Intelligent Virtual Agents, IVA 2019; Paris; France; 2 July 2019 through 5 July 2019
Note

QC 20191024

Available from: 2019-10-24 Created: 2019-10-24 Last updated: 2019-10-24Bibliographically approved
Kontogiorgos, D., Skantze, G., Abelho Pereira, A. T. & Gustafson, J. (2019). The Effects of Embodiment and Social Eye-Gaze in Conversational Agents. In: Proceedings of the 41st Annual Conference of the Cognitive Science Society (CogSci): . Paper presented at 41st Annual Meeting of the Cognitive Science (CogSci), Montreal July 24th – Saturday July 27th, 2019.
Open this publication in new window or tab >>The Effects of Embodiment and Social Eye-Gaze in Conversational Agents
2019 (English)In: Proceedings of the 41st Annual Conference of the Cognitive Science Society (CogSci), 2019Conference paper, Published paper (Refereed)
Abstract [en]

The adoption of conversational agents is growing at a rapid pace. Agents however, are not optimised to simulate key social aspects of situated human conversational environments. Humans are intellectually biased towards social activity when facing more anthropomorphic agents or when presented with subtle social cues. In this work, we explore the effects of simulating anthropomorphism and social eye-gaze in three conversational agents. We tested whether subjects’ visual attention would be similar to agents in different forms of embodiment and social eye-gaze. In a within-subject situated interaction study (N=30), we asked subjects to engage in task-oriented dialogue with a smart speaker and two variations of a social robot. We observed shifting of interactive behaviour by human users, as shown in differences in behavioural and objective measures. With a trade-off in task performance, social facilitation is higher with more anthropomorphic social agents when performing the same task.

National Category
Computer Sciences
Identifiers
urn:nbn:se:kth:diva-255126 (URN)
Conference
41st Annual Meeting of the Cognitive Science (CogSci), Montreal July 24th – Saturday July 27th, 2019
Note

QC 20190722

Available from: 2019-07-21 Created: 2019-07-21 Last updated: 2019-07-22Bibliographically approved
Kontogiorgos, D., Abelho Pereira, A. T. & Gustafson, J. (2019). The Trade-off between Interaction Time and Social Facilitation with Collaborative Social Robots. In: The Challenges of Working on Social Robots that Collaborate with People: . Paper presented at The ACM CHI Conference on Human Factors in Computing Systems, May 4-9, Glasgow, UK.
Open this publication in new window or tab >>The Trade-off between Interaction Time and Social Facilitation with Collaborative Social Robots
2019 (English)In: The Challenges of Working on Social Robots that Collaborate with People, 2019Conference paper, Published paper (Refereed)
Abstract [en]

The adoption of social robots and conversational agents is growing at a rapid pace. These agents, however, are still not optimised to simulate key social aspects of situated human conversational environments. Humans are intellectually biased towards social activity when facing more anthropomorphic agents or when presented with subtle social cues. In this paper, we discuss the effects of simulating anthropomorphism and non-verbal social behaviour in social robots and its implications for human-robot collaborative guided tasks. Our results indicate that it is not always favourable for agents to be anthropomorphised or to communicate with nonverbal behaviour. We found a clear trade-off between interaction time and social facilitation when controlling for anthropomorphism and social behaviour.

Keywords
social robots, smart speakers
National Category
Human Computer Interaction
Research subject
Human-computer Interaction
Identifiers
urn:nbn:se:kth:diva-251651 (URN)
Conference
The ACM CHI Conference on Human Factors in Computing Systems, May 4-9, Glasgow, UK
Available from: 2019-05-16 Created: 2019-05-16 Last updated: 2019-05-17Bibliographically approved
Sibirtseva, E., Kontogiorgos, D., Nykvist, O., Karaoguz, H., Leite, I., Gustafson, J. & Kragic, D. (2018). A Comparison of Visualisation Methods for Disambiguating Verbal Requests in Human-Robot Interaction. In: 2018 27th IEEE International Symposium on Robot and Human Interactive Communication (RO-MAN) 2018 27th IEEE International Symposium on Robot and Human Interactive Communication (RO-MAN): . Paper presented at ROMAN 2018.
Open this publication in new window or tab >>A Comparison of Visualisation Methods for Disambiguating Verbal Requests in Human-Robot Interaction
Show others...
2018 (English)In: 2018 27th IEEE International Symposium on Robot and Human Interactive Communication (RO-MAN) 2018 27th IEEE International Symposium on Robot and Human Interactive Communication (RO-MAN), 2018Conference paper, Published paper (Refereed)
Abstract [en]

Picking up objects requested by a human user is a common task in human-robot interaction. When multiple objects match the user's verbal description, the robot needs to clarify which object the user is referring to before executing the action. Previous research has focused on perceiving user's multimodal behaviour to complement verbal commands or minimising the number of follow up questions to reduce task time. In this paper, we propose a system for reference disambiguation based on visualisation and compare three methods to disambiguate natural language instructions. In a controlled experiment with a YuMi robot, we investigated realtime augmentations of the workspace in three conditions - head-mounted display, projector, and a monitor as the baseline - using objective measures such as time and accuracy, and subjective measures like engagement, immersion, and display interference. Significant differences were found in accuracy and engagement between the conditions, but no differences were found in task time. Despite the higher error rates in the head-mounted display condition, participants found that modality more engaging than the other two, but overall showed preference for the projector condition over the monitor and head-mounted display conditions.

National Category
Human Computer Interaction
Identifiers
urn:nbn:se:kth:diva-235548 (URN)10.1109/ROMAN.2018.8525554 (DOI)978-1-5386-7981-4 (ISBN)
Conference
ROMAN 2018
Note

QC 20181207

Available from: 2018-09-29 Created: 2018-09-29 Last updated: 2018-12-07Bibliographically approved
Organisations
Identifiers
ORCID iD: ORCID iD iconorcid.org/0000-0002-0397-6442

Search in DiVA

Show all publications