Endre søk
RefereraExporteraLink to record
Permanent link

Direct link
Referera
Referensformat
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Annet format
Fler format
Språk
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Annet språk
Fler språk
Utmatningsformat
  • html
  • text
  • asciidoc
  • rtf
Automatic Evaluation of Turn-taking Cues in Conversational Speech Synthesis
KTH, Skolan för elektroteknik och datavetenskap (EECS), Intelligenta system, Tal, musik och hörsel, TMH.ORCID-id: 0000-0003-3513-4132
KTH, Skolan för elektroteknik och datavetenskap (EECS), Intelligenta system, Tal, musik och hörsel, TMH.
KTH, Skolan för elektroteknik och datavetenskap (EECS), Intelligenta system, Tal, musik och hörsel, TMH.ORCID-id: 0000-0003-1175-840X
KTH, Skolan för elektroteknik och datavetenskap (EECS), Intelligenta system, Tal, musik och hörsel, TMH.ORCID-id: 0000-0002-0397-6442
Vise andre og tillknytning
2023 (engelsk)Inngår i: Interspeech 2023, International Speech Communication Association , 2023, s. 5481-5485Konferansepaper, Publicerat paper (Fagfellevurdert)
Abstract [en]

Turn-taking is a fundamental aspect of human communication where speakers convey their intention to either hold, or yield, their turn through prosodic cues. Using the recently proposed Voice Activity Projection model, we propose an automatic evaluation approach to measure these aspects for conversational speech synthesis. We investigate the ability of three commercial, and two open-source, Text-To-Speech (TTS) systems ability to generate turn-taking cues over simulated turns. By varying the stimuli, or controlling the prosody, we analyze the models performances. We show that while commercial TTS largely provide appropriate cues, they often produce ambiguous signals, and that further improvements are possible. TTS, trained on read or spontaneous speech, produce strong turn-hold but weak turn-yield cues. We argue that this approach, that focus on functional aspects of interaction, provides a useful addition to other important speech metrics, such as intelligibility and naturalness.

sted, utgiver, år, opplag, sider
International Speech Communication Association , 2023. s. 5481-5485
Emneord [en]
human-computer interaction, text-to-speech, turn-taking
HSV kategori
Identifikatorer
URN: urn:nbn:se:kth:diva-337873DOI: 10.21437/Interspeech.2023-2064Scopus ID: 2-s2.0-85171597862OAI: oai:DiVA.org:kth-337873DiVA, id: diva2:1803872
Konferanse
24th International Speech Communication Association, Interspeech 2023, Dublin, Ireland, Aug 20 2023 - Aug 24 2023
Merknad

QC 20231010

Tilgjengelig fra: 2023-10-10 Laget: 2023-10-10 Sist oppdatert: 2023-10-10bibliografisk kontrollert

Open Access i DiVA

Fulltekst mangler i DiVA

Andre lenker

Forlagets fulltekstScopus

Person

Ekstedt, ErikWang, SiyangSzékely, ÉvaGustafsson, JoakimSkantze, Gabriel

Søk i DiVA

Av forfatter/redaktør
Ekstedt, ErikWang, SiyangSzékely, ÉvaGustafsson, JoakimSkantze, Gabriel
Av organisasjonen

Søk utenfor DiVA

GoogleGoogle Scholar

doi
urn-nbn

Altmetric

doi
urn-nbn
Totalt: 75 treff
RefereraExporteraLink to record
Permanent link

Direct link
Referera
Referensformat
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Annet format
Fler format
Språk
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Annet språk
Fler språk
Utmatningsformat
  • html
  • text
  • asciidoc
  • rtf