kth.sePublications
Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Beyond graphemes and phonemes: continuous phonological features in neural text-to-speech synthesis
KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.ORCID iD: 0000-0002-9659-1532
KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.ORCID iD: 0000-0002-1886-681X
KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.ORCID iD: 0000-0003-1399-6604
KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.ORCID iD: 0000-0001-9327-9482
2024 (English)In: Interspeech 2024, International Speech Communication Association , 2024, p. 2815-2819Conference paper, Published paper (Refereed)
Abstract [en]

We introduce continuous phonological features as input to TTS with the dual objective of more precise control over phonological aspects and better potential for exploration of latent features in TTS models for speech science purposes. In our framework, the TTS is conditioned on continuous values between 0.0 and 1.0, where each phoneme has a specified position on each feature axis. We chose 11 features to represent US English and trained a voice with Matcha-TTS. Effectiveness was assessed by investigating two selected features in two ways: through a categorical perception experiment confirming the expected alignment of feature positions and phoneme perception, and through analysis of acoustic correlates confirming a gradual, monotonic change of acoustic features consistent with changes in the phonemic input features.

Place, publisher, year, edition, pages
International Speech Communication Association , 2024. p. 2815-2819
Keywords [en]
analysis-by-synthesis, controllable text-to-speech synthesis, phonological features
National Category
Natural Language Processing Computer Sciences
Identifiers
URN: urn:nbn:se:kth:diva-358877DOI: 10.21437/Interspeech.2024-1565Scopus ID: 2-s2.0-85214785956OAI: oai:DiVA.org:kth-358877DiVA, id: diva2:1930530
Conference
25th Interspeech Conferece 2024, Kos Island, Greece, Sep 1 2024 - Sep 5 2024
Note

QC 20250128

Available from: 2025-01-23 Created: 2025-01-23 Last updated: 2025-01-28Bibliographically approved

Open Access in DiVA

No full text in DiVA

Other links

Publisher's full textScopus

Authority records

Tånnander, ChristinaMehta, ShivamBeskow, JonasEdlund, Jens

Search in DiVA

By author/editor
Tånnander, ChristinaMehta, ShivamBeskow, JonasEdlund, Jens
By organisation
Speech, Music and Hearing, TMH
Natural Language ProcessingComputer Sciences

Search outside of DiVA

GoogleGoogle Scholar

doi
urn-nbn

Altmetric score

doi
urn-nbn
Total: 73 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf