Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • harvard1
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Beyond the listening test: An interactive approach to TTS Evaluation
KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH.
2017 (English)In: Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, International Speech Communication Association , 2017, Vol. 2017, p. 249-253Conference paper (Refereed)
Abstract [en]

Traditionally, subjective text-To-speech (TTS) evaluation is performed through audio-only listening tests, where participants evaluate unrelated, context-free utterances. The ecological validity of these tests is questionable, as they do not represent real-world end-use scenarios. In this paper, we examine a novel approach to TTS evaluation in an imagined end-use, via a complex interaction with an avatar. 6 different voice conditions were tested: Natural speech, Unit Selection and Parametric Synthesis, in neutral and expressive realizations. Results were compared to a traditional audio-only evaluation baseline. Participants in both studies rated the voices for naturalness and expressivity. The baseline study showed canonical results for naturalness: Natural speech scored highest, followed by Unit Selection, then Parametric synthesis. Expressivity was clearly distinguishable in all conditions. In the avatar interaction study, participants rated naturalness in the same order as the baseline, though with smaller effect size; expressivity was not distinguishable. Further, no significant correlations were found between cognitive or affective responses and any voice conditions. This highlights 2 primary challenges in designing more valid TTS evaluations: in real-world use-cases involving interaction, listeners generally interact with a single voice, making comparative analysis unfeasible, and in complex interactions, the context and content may confound perception of voice quality.

Place, publisher, year, edition, pages
International Speech Communication Association , 2017. Vol. 2017, p. 249-253
Series
Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, ISSN 2308-457X
Keyword [en]
Expressive speech synthesis, Human-computer interaction, Interactive virtual agents, Listening tests, Statistical parametric speech synthesis, Subjective evaluation, TTS evaluation, Unit selection, User experience, Voice interface design
National Category
Human Computer Interaction
Identifiers
URN: urn:nbn:se:kth:diva-220753DOI: 10.21437/Interspeech.2017-1438Scopus ID: 2-s2.0-85039163349OAI: oai:DiVA.org:kth-220753DiVA, id: diva2:1170929
Conference
18th Annual Conference of the International Speech Communication Association, INTERSPEECH 2017, Stockholm, Sweden, 20 August 2017 through 24 August 2017
Note

QC 20180105

Available from: 2018-01-05 Created: 2018-01-05 Last updated: 2018-01-13Bibliographically approved

Open Access in DiVA

No full text in DiVA

Other links

Publisher's full textScopus

Search in DiVA

By author/editor
Mendelson, Joseph
By organisation
Speech, Music and Hearing, TMH
Human Computer Interaction

Search outside of DiVA

GoogleGoogle Scholar

doi
urn-nbn

Altmetric score

doi
urn-nbn
Total: 24 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • harvard1
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf