kth.sePublications
Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Automatic Evaluation of Turn-taking Cues in Conversational Speech Synthesis
KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.ORCID iD: 0000-0003-3513-4132
KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.
KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.ORCID iD: 0000-0003-1175-840X
KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.ORCID iD: 0000-0002-0397-6442
Show others and affiliations
2023 (English)In: Interspeech 2023, International Speech Communication Association , 2023, p. 5481-5485Conference paper, Published paper (Refereed)
Abstract [en]

Turn-taking is a fundamental aspect of human communication where speakers convey their intention to either hold, or yield, their turn through prosodic cues. Using the recently proposed Voice Activity Projection model, we propose an automatic evaluation approach to measure these aspects for conversational speech synthesis. We investigate the ability of three commercial, and two open-source, Text-To-Speech (TTS) systems ability to generate turn-taking cues over simulated turns. By varying the stimuli, or controlling the prosody, we analyze the models performances. We show that while commercial TTS largely provide appropriate cues, they often produce ambiguous signals, and that further improvements are possible. TTS, trained on read or spontaneous speech, produce strong turn-hold but weak turn-yield cues. We argue that this approach, that focus on functional aspects of interaction, provides a useful addition to other important speech metrics, such as intelligibility and naturalness.

Place, publisher, year, edition, pages
International Speech Communication Association , 2023. p. 5481-5485
Keywords [en]
human-computer interaction, text-to-speech, turn-taking
National Category
Language Technology (Computational Linguistics) Computer Sciences General Language Studies and Linguistics
Identifiers
URN: urn:nbn:se:kth:diva-337873DOI: 10.21437/Interspeech.2023-2064Scopus ID: 2-s2.0-85171597862OAI: oai:DiVA.org:kth-337873DiVA, id: diva2:1803872
Conference
24th International Speech Communication Association, Interspeech 2023, Dublin, Ireland, Aug 20 2023 - Aug 24 2023
Note

QC 20231010

Available from: 2023-10-10 Created: 2023-10-10 Last updated: 2023-10-10Bibliographically approved

Open Access in DiVA

No full text in DiVA

Other links

Publisher's full textScopus

Authority records

Ekstedt, ErikWang, SiyangSzékely, ÉvaGustafsson, JoakimSkantze, Gabriel

Search in DiVA

By author/editor
Ekstedt, ErikWang, SiyangSzékely, ÉvaGustafsson, JoakimSkantze, Gabriel
By organisation
Speech, Music and Hearing, TMH
Language Technology (Computational Linguistics)Computer SciencesGeneral Language Studies and Linguistics

Search outside of DiVA

GoogleGoogle Scholar

doi
urn-nbn

Altmetric score

doi
urn-nbn
Total: 27 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf