This paper presents the use of variable length stimuli for assessing segmental distortion in Text-to-Speech synthesizers. The design is based on the well-established principle of stimulus accumulation phenomenon in psychophysics. The length of the stimuli is varied logarithmically, in accordance with the Weber–Fechner law. User opinion is collected in a binary, two-choice format, suspending the vagueness of the term “naturalness”. The participants’ responses are captured using a 2-alternative forced choice task. The study found that while the length of the stimuli did not reliably affect participants’ accuracy in the task, the concentration of voiceless obstruents did have a significant effect. Participants were consistently more accurate in identifying WaveNet stimuli as machine-made when the phrases were obstruent-rich. These findings show that the deviation in obstruents reported in WaveNet voices is perceivable by human listeners. The design of the subjective listening test shows similar trends to Mean-Opinion-Score evaluation, suggesting that the design may be of utility to the wider community of Text-to-Speech evaluation.
QC 20251121