kth.sePublications KTH
Operational message
There are currently operational disruptions. Troubleshooting is in progress.
Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Assessing the impact of contextual framing on subjective TTS quality
KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.ORCID iD: 0000-0001-9327-9482
KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH. Swedish Agency for Accessible Media (MTM), Sweden.ORCID iD: 0000-0002-9659-1532
University of Helsinki, Finland.
Bielefeld University, Germany.
2024 (English)In: Interspeech 2024, International Speech Communication Association , 2024, p. 1205-1209Conference paper, Published paper (Refereed)
Abstract [en]

Text-To-Speech (TTS) evaluations are habitually carried out without contextual and situational framing. Since humans adapt their speaking style to situation specific communicative needs, such evaluations may not generalize across situations. Without clearly defined framing, it is even unclear in which situations evaluation results hold at all. We test the hypothesized impact of framing on TTS evaluation in a crowdsourced MOS evaluation of four TTS voices, systematically varying (a) the intended TTS task (domestic humanoid robot, child's voice replacement, fiction audio books and long and information-rich texts) and (b) the framing of that task. The results show that framing differentiated MOS responses, with individual TTS performance varying significantly across tasks and framings. This corroborates the assumption that decontextualized MOS evaluations do not generalize, and suggests that TTS evaluations should not be reported without the type of framing that was employed, if any.

Place, publisher, year, edition, pages
International Speech Communication Association , 2024. p. 1205-1209
Keywords [en]
evaluation, framing, methodology, MOS
National Category
Natural Language Processing
Identifiers
URN: urn:nbn:se:kth:diva-358870DOI: 10.21437/Interspeech.2024-781ISI: 001331850101070Scopus ID: 2-s2.0-85214812427OAI: oai:DiVA.org:kth-358870DiVA, id: diva2:1930523
Conference
25th Interspeech Conferece 2024, Kos Island, Greece, September 1-5, 2024
Note

QC 20250127

Available from: 2025-01-23 Created: 2025-01-23 Last updated: 2025-12-08Bibliographically approved

Open Access in DiVA

No full text in DiVA

Other links

Publisher's full textScopus

Authority records

Edlund, JensTånnander, Christina

Search in DiVA

By author/editor
Edlund, JensTånnander, Christina
By organisation
Speech, Music and Hearing, TMH
Natural Language Processing

Search outside of DiVA

GoogleGoogle Scholar

doi
urn-nbn

Altmetric score

doi
urn-nbn
Total: 90 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf