kth.sePublications KTH
Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
The use of variable length stimuli for assessing segmental distortion in TTS evaluation
Sigmedia Lab, School of Engineering, Trinity College Dublin, Ireland.
KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.ORCID iD: 0000-0001-9327-9482
Sigmedia Lab, School of Engineering, Trinity College Dublin, Ireland; University of Helsinki, Helsinki, Finland.
Sigmedia Lab, School of Engineering, Trinity College Dublin, Ireland.
2026 (English)In: Computer speech & language (Print), ISSN 0885-2308, E-ISSN 1095-8363, Vol. 97, article id 101894Article in journal (Refereed) Published
Abstract [en]

This paper presents the use of variable length stimuli for assessing segmental distortion in Text-to-Speech synthesizers. The design is based on the well-established principle of stimulus accumulation phenomenon in psychophysics. The length of the stimuli is varied logarithmically, in accordance with the Weber–Fechner law. User opinion is collected in a binary, two-choice format, suspending the vagueness of the term “naturalness”. The participants’ responses are captured using a 2-alternative forced choice task. The study found that while the length of the stimuli did not reliably affect participants’ accuracy in the task, the concentration of voiceless obstruents did have a significant effect. Participants were consistently more accurate in identifying WaveNet stimuli as machine-made when the phrases were obstruent-rich. These findings show that the deviation in obstruents reported in WaveNet voices is perceivable by human listeners. The design of the subjective listening test shows similar trends to Mean-Opinion-Score evaluation, suggesting that the design may be of utility to the wider community of Text-to-Speech evaluation.

Place, publisher, year, edition, pages
Elsevier BV , 2026. Vol. 97, article id 101894
Keywords [en]
Naturalness, Neural TTS, Obstruents, Segmental evaluation, Sonorants, Text-to-speech evaluation
National Category
Psychology
Identifiers
URN: urn:nbn:se:kth:diva-373143DOI: 10.1016/j.csl.2025.101894ISI: 001607689800001Scopus ID: 2-s2.0-105020921824OAI: oai:DiVA.org:kth-373143DiVA, id: diva2:2015477
Note

QC 20251121

Available from: 2025-11-21 Created: 2025-11-21 Last updated: 2025-11-21Bibliographically approved

Open Access in DiVA

No full text in DiVA

Other links

Publisher's full textScopus

Authority records

Edlund, Jens

Search in DiVA

By author/editor
Edlund, Jens
By organisation
Speech, Music and Hearing, TMH
In the same journal
Computer speech & language (Print)
Psychology

Search outside of DiVA

GoogleGoogle Scholar

doi
urn-nbn

Altmetric score

doi
urn-nbn
Total: 62 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf