kth.sePublications KTH
Operational message
There are currently operational disruptions. Troubleshooting is in progress.
Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Situating speech synthesis: Investigating contextual factors in the evaluation of conversational TTS
KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.ORCID iD: 0000-0001-9537-8505
KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.ORCID iD: 0000-0003-0292-1164
KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.ORCID iD: 0000-0002-0397-6442
KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.ORCID iD: 0000-0003-1175-840X
2023 (English)In: Proceedings of the 12th ISCA Speech Synthesis Workshop (SSW), Grenoble, France, August 26–28, 2023, International Speech Communication Association , 2023, p. 69-74Conference paper, Published paper (Refereed)
Abstract [en]

Speech synthesis evaluation methods have lagged behind the development of TTS systems, with single sentence read-speech MOS naturalness evaluation on crowdsourcing platforms being the industry standard. For TTS to successfully be applied insocial contexts, evaluation methods need to be socially embedded in the situation where they will be deployed. Due to the time and cost constraints of conducting an in-person interaction evaluation for TTS, we examine the effect of introducing situational context and preceding sentence context to participants in a subjective listening experiment. We conduct a suitability evaluation for a robot game guide that explains game rules to participants using two synthesized spontaneous voices: an instruction-specific and a general spontaneous voice. Results indicate that the inclusion of context influences user ratings, highlighting the need for context-aware evaluations. However, the type of context did not significantly affect the results.

Place, publisher, year, edition, pages
International Speech Communication Association , 2023. p. 69-74
Keywords [en]
speech synthesis, text to speech, evaluation, social, context
National Category
Natural Language Processing
Research subject
Computer Science
Identifiers
URN: urn:nbn:se:kth:diva-365096DOI: 10.21437/SSW.2023-11OAI: oai:DiVA.org:kth-365096DiVA, id: diva2:1972177
Conference
12th ISCA Speech Synthesis Workshop (SSW), Grenoble, France, August 26–28, 2023
Projects
VR-2019-05003VR-2020-02396P20-0298
Funder
Swedish Research Council, 2019-05003Swedish Research Council, 2020-02396Riksbankens Jubileumsfond, P20-0298
Note

QC 20250701

Available from: 2025-06-18 Created: 2025-06-18 Last updated: 2025-07-01Bibliographically approved

Open Access in DiVA

No full text in DiVA

Other links

Publisher's full text

Authority records

Lameris, HarmKirkland, AmbikaGustafsson, JoakimSzékely, Éva

Search in DiVA

By author/editor
Lameris, HarmKirkland, AmbikaGustafsson, JoakimSzékely, Éva
By organisation
Speech, Music and Hearing, TMH
Natural Language Processing

Search outside of DiVA

GoogleGoogle Scholar

doi
urn-nbn

Altmetric score

doi
urn-nbn
Total: 85 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf