kth.sePublications
Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Stress manipulation in text-to-speech synthesis using speaking rate categories
KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH, Speech Communication and Technology.ORCID iD: 0000-0002-9659-1532
KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.ORCID iD: 0000-0001-9327-9482
2021 (English)In: Proceedings of Fonetik 2021, Centre for Languages and Literature, Lund University / [ed] Anna Hjortdal and Mikael Roll, Lund, 2021, Vol. 56, p. 17-22Conference paper, Published paper (Other academic)
Abstract [en]

The challenge of controlling prosody in text-to-speech systems (TTS) is as old as TTS itself. The problem is not just to know what the desired stress or intonation patterns are, nor is it limited to knowing how to control specific speech parameters (e.g. durations, amplitude and fundamental frequency). We also need to know the precise speech parameters settings that correspond to a certain stress or intonation pattern ±over entire utterances.We propose that the powerful TTS models afforded by deep neural networks (DNN¶s), combined with the fact that speech parameters often are correlated and vary in orchestration, allow us to solve at least some stress and intonation parts by influencing a single easy-to-controlparameter, rather than detailed control over many parameters.The paper presents a straightforward method of guiding word durations without recording training material especially for this purpose. The resulting TTS engine is used to produce sentences containing Swedish words that are unstressed in their most common function, but stressed in another common function. The sentences are designed so that it is clear to a listener that the second function is the intended. In these cases, TTS engines often fail and produce an unstressed version.A group of 20 listeners compared samples that the TTS produced without guidance with samples where it was instructed to slow down the test words. The listeners almost unanimously preferred the latter version. This supports the notion that due to the orchestrated variation of speech characteristics and the strength of modern DNN models, we can provide prosodic guidance to DNN-based TTS systems without having to control every characteristic in detail.

Place, publisher, year, edition, pages
Lund, 2021. Vol. 56, p. 17-22
National Category
Other Engineering and Technologies
Research subject
Speech and Music Communication
Identifiers
URN: urn:nbn:se:kth:diva-304363OAI: oai:DiVA.org:kth-304363DiVA, id: diva2:1608077
Conference
Fonetik 2021, Date 8-9 June 2021
Note

QC 20211216

Available from: 2021-11-02 Created: 2021-11-02 Last updated: 2025-02-10Bibliographically approved

Open Access in DiVA

No full text in DiVA

Other links

Published fulltext

Authority records

Tånnander, ChristinaEdlund, Jens

Search in DiVA

By author/editor
Tånnander, ChristinaEdlund, Jens
By organisation
Speech Communication and TechnologySpeech, Music and Hearing, TMH
Other Engineering and Technologies

Search outside of DiVA

GoogleGoogle Scholar

urn-nbn

Altmetric score

urn-nbn
Total: 69 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf