kth.sePublications
Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Syllable duration as a proxy to latent prosodic features
KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.ORCID iD: 0000-0002-9659-1532
KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.ORCID iD: 0000-0002-4628-3769
KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.ORCID iD: 0000-0001-9327-9482
2022 (English)In: Proceedings of Speech Prosody 2022, Lisbon, Portugal: International Speech Communication Association , 2022, p. 220-224Conference paper, Published paper (Refereed)
Abstract [en]

Recent advances in deep-learning have pushed text-to-speech synthesis (TTS) very close to human speech. In deep-learning, latent features refer to features that are hidden from us; notwithstanding, we may meaningfully observe their effects. Analogously, latent prosodic features refer to the exact features that constitute e.g. prominence that are unknown to us, although we know (some of) the functions of prominence and (some of) its acoustic correlates. Deep-learned speech models capture prosody well, but leave us with little control and few insights. Previously, we explored average syllable duration on word level - a simple and accessible metric - as a proxy for prominence: in Swedish TTS, where verb particles and numerals tend to receive too little prominence, these were nudged towards lengthening while allowing the TTS models to otherwise operate freely. Listener panels overwhelmingly preferred the nudged versions to the unmodified TTS. In this paper, we analyse utterances from the modified TTS. The analysis shows that duration-nudging of relevant words changes the following features in an observable manner: duration is predictably lengthened, word-initial glottalization occurs, and the general intonation pattern changes. This supports the view of latent prosodic features that can be reflected in deep-learned models and accessed by proxy.

Place, publisher, year, edition, pages
Lisbon, Portugal: International Speech Communication Association , 2022. p. 220-224
National Category
Other Humanities not elsewhere specified
Research subject
Speech and Music Communication
Identifiers
URN: urn:nbn:se:kth:diva-314984DOI: 10.21437/SpeechProsody.2022-45Scopus ID: 2-s2.0-85166333598OAI: oai:DiVA.org:kth-314984DiVA, id: diva2:1677198
Conference
Speech Prosody 2022 23-26 May 2022, Lisbon, Portugal
Note

QC 20220628

Available from: 2022-06-27 Created: 2022-06-27 Last updated: 2024-08-28Bibliographically approved

Open Access in DiVA

No full text in DiVA

Other links

Publisher's full textScopusPublished fulltext

Authority records

Tånnander, ChristinaHouse, DavidEdlund, Jens

Search in DiVA

By author/editor
Tånnander, ChristinaHouse, DavidEdlund, Jens
By organisation
Speech, Music and Hearing, TMH
Other Humanities not elsewhere specified

Search outside of DiVA

GoogleGoogle Scholar

doi
urn-nbn

Altmetric score

doi
urn-nbn
Total: 346 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf