kth.sePublikationer KTH
Driftmeddelande
För närvarande är det driftstörningar. Felsökning pågår.
Ändra sökning
Länk till posten
Permanent länk

Direktlänk
Tånnander, Christina, DoktorandORCID iD iconorcid.org/0000-0002-9659-1532
Publikationer (10 of 27) Visa alla publikationer
Tånnander, C., House, D., Beskow, J. & Edlund, J. (2025). Intrasentential English in Swedish TTS: perceived English-accentedness. In: Interspeech 2025: . Paper presented at 26th Interspeech Conference 2025, Rotterdam, Netherlands, Kingdom of the, August 17-21, 2025 (pp. 1638-1642). International Speech Communication Association
Öppna denna publikation i ny flik eller fönster >>Intrasentential English in Swedish TTS: perceived English-accentedness
2025 (Engelska)Ingår i: Interspeech 2025, International Speech Communication Association , 2025, s. 1638-1642Konferensbidrag, Publicerat paper (Refereegranskat)
Abstract [en]

English names and expressions are frequently inserted into Swedish text. Humans intuitively adjust the degree of English pronunciation of such insertions. This work aims at a Swedish text-to-speech synthesis (TTS) capable of similar controlled adaptation. We focus on two key aspects: (1) the development of a TTS system with controllable degrees of perceived English-accentedness (PEA); and (2) the exploration of human preferences related to PEA. We trained a Swedish TTS voice on Swedish and English sentences with a conditioning parameter for language (English-accentedness, EA) on a scale from 0 to 1, and estimated a psychometric mapping of the perceived effect of EA to a perceptual scale (PEA) through perception tests. PEA was then used in Best-Worst listening tests presenting English insertions with varying PEA. The results confirm the effectiveness of the training and the PEA scale, and that listener preferences change with different insertions.

Ort, förlag, år, upplaga, sidor
International Speech Communication Association, 2025
Nyckelord
controllable TTS, mixed language, read speech
Nationell ämneskategori
Språkbehandling och datorlingvistik
Identifikatorer
urn:nbn:se:kth:diva-372797 (URN)10.21437/Interspeech.2025-762 (DOI)2-s2.0-105020040227 (Scopus ID)
Konferens
26th Interspeech Conference 2025, Rotterdam, Netherlands, Kingdom of the, August 17-21, 2025
Anmärkning

QC 20251118

Tillgänglig från: 2025-11-18 Skapad: 2025-11-18 Senast uppdaterad: 2025-11-18Bibliografiskt granskad
Edlund, J., Tånnander, C., Le Maguer, S. & Wagner, P. (2024). Assessing the impact of contextual framing on subjective TTS quality. In: Interspeech 2024: . Paper presented at 25th Interspeech Conferece 2024, Kos Island, Greece, September 1-5, 2024 (pp. 1205-1209). International Speech Communication Association
Öppna denna publikation i ny flik eller fönster >>Assessing the impact of contextual framing on subjective TTS quality
2024 (Engelska)Ingår i: Interspeech 2024, International Speech Communication Association , 2024, s. 1205-1209Konferensbidrag, Publicerat paper (Refereegranskat)
Abstract [en]

Text-To-Speech (TTS) evaluations are habitually carried out without contextual and situational framing. Since humans adapt their speaking style to situation specific communicative needs, such evaluations may not generalize across situations. Without clearly defined framing, it is even unclear in which situations evaluation results hold at all. We test the hypothesized impact of framing on TTS evaluation in a crowdsourced MOS evaluation of four TTS voices, systematically varying (a) the intended TTS task (domestic humanoid robot, child's voice replacement, fiction audio books and long and information-rich texts) and (b) the framing of that task. The results show that framing differentiated MOS responses, with individual TTS performance varying significantly across tasks and framings. This corroborates the assumption that decontextualized MOS evaluations do not generalize, and suggests that TTS evaluations should not be reported without the type of framing that was employed, if any.

Ort, förlag, år, upplaga, sidor
International Speech Communication Association, 2024
Nyckelord
evaluation, framing, methodology, MOS
Nationell ämneskategori
Språkbehandling och datorlingvistik
Identifikatorer
urn:nbn:se:kth:diva-358870 (URN)10.21437/Interspeech.2024-781 (DOI)001331850101070 ()2-s2.0-85214812427 (Scopus ID)
Konferens
25th Interspeech Conferece 2024, Kos Island, Greece, September 1-5, 2024
Anmärkning

QC 20250127

Tillgänglig från: 2025-01-23 Skapad: 2025-01-23 Senast uppdaterad: 2025-12-08Bibliografiskt granskad
Tånnander, C., Mehta, S., Beskow, J. & Edlund, J. (2024). Beyond graphemes and phonemes: continuous phonological features in neural text-to-speech synthesis. In: Interspeech 2024: . Paper presented at 25th Interspeech Conferece 2024, Kos Island, Greece, Sep 1 2024 - Sep 5 2024 (pp. 2815-2819). International Speech Communication Association
Öppna denna publikation i ny flik eller fönster >>Beyond graphemes and phonemes: continuous phonological features in neural text-to-speech synthesis
2024 (Engelska)Ingår i: Interspeech 2024, International Speech Communication Association , 2024, s. 2815-2819Konferensbidrag, Publicerat paper (Refereegranskat)
Abstract [en]

We introduce continuous phonological features as input to TTS with the dual objective of more precise control over phonological aspects and better potential for exploration of latent features in TTS models for speech science purposes. In our framework, the TTS is conditioned on continuous values between 0.0 and 1.0, where each phoneme has a specified position on each feature axis. We chose 11 features to represent US English and trained a voice with Matcha-TTS. Effectiveness was assessed by investigating two selected features in two ways: through a categorical perception experiment confirming the expected alignment of feature positions and phoneme perception, and through analysis of acoustic correlates confirming a gradual, monotonic change of acoustic features consistent with changes in the phonemic input features.

Ort, förlag, år, upplaga, sidor
International Speech Communication Association, 2024
Nyckelord
analysis-by-synthesis, controllable text-to-speech synthesis, phonological features
Nationell ämneskategori
Språkbehandling och datorlingvistik Datavetenskap (datalogi)
Identifikatorer
urn:nbn:se:kth:diva-358877 (URN)10.21437/Interspeech.2024-1565 (DOI)001331850102192 ()2-s2.0-85214785956 (Scopus ID)
Konferens
25th Interspeech Conferece 2024, Kos Island, Greece, Sep 1 2024 - Sep 5 2024
Anmärkning

QC 20250128

Tillgänglig från: 2025-01-23 Skapad: 2025-01-23 Senast uppdaterad: 2025-12-08Bibliografiskt granskad
Tånnander, C., O'Regan, J., House, D., Edlund, J. & Beskow, J. (2024). Prosodic characteristics of English-accented Swedish neural TTS. In: Proceedings of Speech Prosody 2024: . Paper presented at Speech Prosody 2024, Leiden, The Netherlands, 2-5 July 2024 (pp. 1035-1039). Leiden, The Netherlands: International Speech Communication Association
Öppna denna publikation i ny flik eller fönster >>Prosodic characteristics of English-accented Swedish neural TTS
Visa övriga...
2024 (Engelska)Ingår i: Proceedings of Speech Prosody 2024, Leiden, The Netherlands: International Speech Communication Association , 2024, s. 1035-1039Konferensbidrag, Publicerat paper (Refereegranskat)
Abstract [en]

Neural text-to-speech synthesis (TTS) captures prosodicfeatures strikingly well, notwithstanding the lack of prosodiclabels in training or synthesis. We trained a voice on a singleSwedish speaker reading in Swedish and English. The resultingTTS allows us to control the degree of English-accentedness inSwedish sentences. English-accented Swedish commonlyexhibits well-known prosodic characteristics such as erroneoustonal accents and understated or missed durational differences.TTS quality was verified in three ways. Automatic speechrecognition resulted in low errors, verifying intelligibility.Automatic language classification had Swedish as the majoritychoice, while the likelihood of English increased with ourtargeted degree of English-accentedness. Finally, a rank ofperceived English-accentedness acquired through pairwisecomparisons by 20 human listeners demonstrated a strongcorrelation with the targeted English-accentedness.We report on phonetic and prosodic analyses of theaccented TTS. In addition to the anticipated segmentaldifferences, the analyses revealed temporal and prominencerelated variations coherent with Swedish spoken by Englishspeakers, such as missing Swedish stress patterns and overlyreduced unstressed syllables. With this work, we aim to gleaninsights into speech prosody from the latent prosodic featuresof neural TTS models. In addition, it will help implementspeech phenomena such as code switching in TTS

Ort, förlag, år, upplaga, sidor
Leiden, The Netherlands: International Speech Communication Association, 2024
Nyckelord
foreign-accented text-to-speech synthesis, neural text-to-speech synthesis, latent prosodic features
Nationell ämneskategori
Humaniora och konst Jämförande språkvetenskap och allmän lingvistik
Forskningsämne
Tal- och musikkommunikation
Identifikatorer
urn:nbn:se:kth:diva-349946 (URN)10.21437/SpeechProsody.2024-209 (DOI)2-s2.0-105008058763 (Scopus ID)
Konferens
Speech Prosody 2024, Leiden, The Netherlands, 2-5 July 2024
Projekt
Deep learning based speech synthesis for reading aloud of lengthy and information rich texts in Swedish (2018-02427)Språkbanken Tal (2017-00626)
Forskningsfinansiär
Vinnova, (2018-02427
Anmärkning

QC 20240705

Tillgänglig från: 2024-07-03 Skapad: 2024-07-03 Senast uppdaterad: 2025-07-01Bibliografiskt granskad
Tånnander, C., Edlund, J. & Gustafsson, J. (2024). Revisiting Three Text-to-Speech Synthesis Experiments with a Web-Based Audience Response System. In: 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation, LREC-COLING 2024 - Main Conference Proceedings: . Paper presented at Joint 30th International Conference on Computational Linguistics and 14th International Conference on Language Resources and Evaluation, LREC-COLING 2024, Hybrid, Torino, Italy, May 20 2024 - May 25 2024 (pp. 14111-14121). European Language Resources Association (ELRA)
Öppna denna publikation i ny flik eller fönster >>Revisiting Three Text-to-Speech Synthesis Experiments with a Web-Based Audience Response System
2024 (Engelska)Ingår i: 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation, LREC-COLING 2024 - Main Conference Proceedings, European Language Resources Association (ELRA) , 2024, s. 14111-14121Konferensbidrag, Publicerat paper (Refereegranskat)
Abstract [en]

In order to investigate the strengths and weaknesses of Audience Response System (ARS) in text-to-speech synthesis (TTS) evaluations, we revisit three previously published TTS studies and perform an ARS-based evaluation on the stimuli used in each study. The experiments are performed with a participant pool of 39 respondents, using a web-based tool that emulates an ARS experiment. The results of the first experiment confirms that ARS is highly useful for evaluating long and continuous stimuli, particularly if we wish for a diagnostic result rather than a single overall metric, while the second and third experiments highlight weaknesses in ARS with unsuitable materials as well as the importance of framing and instruction when conducting ARS-based evaluation.

Ort, förlag, år, upplaga, sidor
European Language Resources Association (ELRA), 2024
Nyckelord
audience response system, evaluation methodology, TTS evaluation
Nationell ämneskategori
Språkbehandling och datorlingvistik
Identifikatorer
urn:nbn:se:kth:diva-348784 (URN)2-s2.0-85195897862 (Scopus ID)
Konferens
Joint 30th International Conference on Computational Linguistics and 14th International Conference on Language Resources and Evaluation, LREC-COLING 2024, Hybrid, Torino, Italy, May 20 2024 - May 25 2024
Anmärkning

Part of ISBN 9782493814104

QC 20240701

Tillgänglig från: 2024-06-27 Skapad: 2024-06-27 Senast uppdaterad: 2025-02-07Bibliografiskt granskad
Tånnander, C., House, D. & Edlund, J. (2023). Analysis-by-synthesis: phonetic-phonological variation indeep neural network-based text-to-speech synthesis. In: Radek Skarnitzl and Jan Volín (Ed.), Proceedings of the 20th International Congress of Phonetic Sciences, Prague 2023: . Paper presented at 20th International Congress of Phonetic Sciences (ICPhS), August 7-11, 2023, Prague, Czech Republic (pp. 3156-3160). Prague, Czech Republic: GUARANT International
Öppna denna publikation i ny flik eller fönster >>Analysis-by-synthesis: phonetic-phonological variation indeep neural network-based text-to-speech synthesis
2023 (Engelska)Ingår i: Proceedings of the 20th International Congress of Phonetic Sciences, Prague 2023 / [ed] Radek Skarnitzl and Jan Volín, Prague, Czech Republic: GUARANT International , 2023, s. 3156-3160Konferensbidrag, Publicerat paper (Refereegranskat)
Abstract [en]

Text-to-speech synthesis based on deep neuralnetworks can generate highly humanlike speech,which revitalizes the potential for analysis-bysynthesis in speech research. We propose that neuralsynthesis can provide evidence that a specificdistinction in its transcription system represents arobust acoustic/phonetic distinction in the speechused to train the model.We synthesized utterances with allophones inincorrect contexts and analyzed the resultsphonetically. Our assumption was that if we gainedcontrol over the allophonic variation in this way, itwould provide strong evidence that the variation isgoverned robustly by the phonological context usedto create the transcriptions.Of three allophonic variations investigated, thefirst, which was believed to be quite robust, gave usrobust control over the variation, while the other two,which are less categorical, did not afford us suchcontrol. These findings are consistent with ourhypothesis and support the notion that neural TTS canbe a valuable analysis-by-synthesis tool for speechresearch. 

Ort, förlag, år, upplaga, sidor
Prague, Czech Republic: GUARANT International, 2023
Nyckelord
analysis-by-synthesis, latent phonetic features, phonological variation, neural TTS
Nationell ämneskategori
Annan teknik
Forskningsämne
Tal- och musikkommunikation
Identifikatorer
urn:nbn:se:kth:diva-336586 (URN)
Konferens
20th International Congress of Phonetic Sciences (ICPhS), August 7-11, 2023, Prague, Czech Republic
Forskningsfinansiär
Vinnova, 2018-02427
Anmärkning

Part of ISBN 978-80-908 114-2-3

QC 20230915

Tillgänglig från: 2023-09-14 Skapad: 2023-09-14 Senast uppdaterad: 2025-02-10Bibliografiskt granskad
Tånnander, C. & Edlund, J. (2022). Mapping specific characteristics of spoken text to listener ratings. In: Proceedings of Fonetik 2022: . Paper presented at fonetikmötet, Fonetik 2022, 13-15 juni 2022 på Tal, musik och hörsel, Kungliga Tekniska Högskolan. Stockholm, Sweden
Öppna denna publikation i ny flik eller fönster >>Mapping specific characteristics of spoken text to listener ratings
2022 (Engelska)Ingår i: Proceedings of Fonetik 2022, Stockholm, Sweden, 2022Konferensbidrag, Publicerat paper (Övrigt vetenskapligt)
Ort, förlag, år, upplaga, sidor
Stockholm, Sweden: , 2022
Nationell ämneskategori
Språkbehandling och datorlingvistik
Forskningsämne
Tal- och musikkommunikation
Identifikatorer
urn:nbn:se:kth:diva-314986 (URN)
Konferens
fonetikmötet, Fonetik 2022, 13-15 juni 2022 på Tal, musik och hörsel, Kungliga Tekniska Högskolan
Anmärkning

QCR 20220628

Tillgänglig från: 2022-06-27 Skapad: 2022-06-27 Senast uppdaterad: 2025-02-07Bibliografiskt granskad
Tånnander, C. & Edlund, J. (2022). Sardin: speech-oriented text processing. In: Proceedings of Fonetik 2022: . Paper presented at fonetikmötet, Fonetik 2022, 13-15 juni 2022 på Tal, musik och hörsel, Kungliga Tekniska Högskolan. Stockholm, Sweden
Öppna denna publikation i ny flik eller fönster >>Sardin: speech-oriented text processing
2022 (Engelska)Ingår i: Proceedings of Fonetik 2022, Stockholm, Sweden, 2022Konferensbidrag, Publicerat paper (Övrigt vetenskapligt)
Ort, förlag, år, upplaga, sidor
Stockholm, Sweden: , 2022
Nationell ämneskategori
Språkbehandling och datorlingvistik
Forskningsämne
Tal- och musikkommunikation
Identifikatorer
urn:nbn:se:kth:diva-314985 (URN)
Konferens
fonetikmötet, Fonetik 2022, 13-15 juni 2022 på Tal, musik och hörsel, Kungliga Tekniska Högskolan
Anmärkning

QCR 20220628

Tillgänglig från: 2022-06-27 Skapad: 2022-06-27 Senast uppdaterad: 2025-02-07Bibliografiskt granskad
Tånnander, C., House, D. & Edlund, J. (2022). Syllable duration as a proxy to latent prosodic features. In: Proceedings of Speech Prosody 2022: . Paper presented at Speech Prosody 2022 23-26 May 2022, Lisbon, Portugal (pp. 220-224). Lisbon, Portugal: International Speech Communication Association
Öppna denna publikation i ny flik eller fönster >>Syllable duration as a proxy to latent prosodic features
2022 (Engelska)Ingår i: Proceedings of Speech Prosody 2022, Lisbon, Portugal: International Speech Communication Association , 2022, s. 220-224Konferensbidrag, Publicerat paper (Refereegranskat)
Abstract [en]

Recent advances in deep-learning have pushed text-to-speech synthesis (TTS) very close to human speech. In deep-learning, latent features refer to features that are hidden from us; notwithstanding, we may meaningfully observe their effects. Analogously, latent prosodic features refer to the exact features that constitute e.g. prominence that are unknown to us, although we know (some of) the functions of prominence and (some of) its acoustic correlates. Deep-learned speech models capture prosody well, but leave us with little control and few insights. Previously, we explored average syllable duration on word level - a simple and accessible metric - as a proxy for prominence: in Swedish TTS, where verb particles and numerals tend to receive too little prominence, these were nudged towards lengthening while allowing the TTS models to otherwise operate freely. Listener panels overwhelmingly preferred the nudged versions to the unmodified TTS. In this paper, we analyse utterances from the modified TTS. The analysis shows that duration-nudging of relevant words changes the following features in an observable manner: duration is predictably lengthened, word-initial glottalization occurs, and the general intonation pattern changes. This supports the view of latent prosodic features that can be reflected in deep-learned models and accessed by proxy.

Ort, förlag, år, upplaga, sidor
Lisbon, Portugal: International Speech Communication Association, 2022
Nationell ämneskategori
Övrig annan humaniora
Forskningsämne
Tal- och musikkommunikation
Identifikatorer
urn:nbn:se:kth:diva-314984 (URN)10.21437/SpeechProsody.2022-45 (DOI)2-s2.0-85166333598 (Scopus ID)
Konferens
Speech Prosody 2022 23-26 May 2022, Lisbon, Portugal
Anmärkning

QC 20220628

Tillgänglig från: 2022-06-27 Skapad: 2022-06-27 Senast uppdaterad: 2024-08-28Bibliografiskt granskad
Tånnander, C. & Edlund, J. (2022). Towards a Swedish test set for speech-oriented text normalisation. In: : . Paper presented at Swedish Language Technology Conference (SLTC),November 18-20 2020, Göteborg. Göteborg: Göteborgs universitet
Öppna denna publikation i ny flik eller fönster >>Towards a Swedish test set for speech-oriented text normalisation
2022 (Engelska)Konferensbidrag, Publicerat paper (Övrigt vetenskapligt)
Abstract [en]

Text-to-speech synthesis (TTS) can be split into two steps: the preprocessor, which takes input text, including its encoding and formatting, and turns it into a representation that is accepted by the synthesizer, which in turn converts this representation into an acoustic waveform representing speech. TTS is commonly evaluated in terms of how intelligible or humanlike the speech is, where different synthesizers working on the same input representation are regularly compared, whereas the preprocessing is habitually ignored in TTS evaluation. Were we to evaluate preprocessing, we could evaluate it as a whole (e.g. compare its output for some input representation to a target phonemic representation) or as individual processes such as sentence detection, tokenisation, text normalisation (TN) and pronunciation generation.This paper focuses on the evaluation of speech-oriented text normalisation (STN), that is the conversion of the input text into an expanded string of the words to be spoken, for example expansions of. abbreviations and different types of numerals. It is a request for comments for the creation of a test set for the evaluation of Swedish STN, which can be used as a baseline for future STN models, and as part of the overall evaluation of Swedish speech-oriented preprocessing.

Ort, förlag, år, upplaga, sidor
Göteborg: Göteborgs universitet, 2022
Nyckelord
speech-oriented text processing, test set
Nationell ämneskategori
Språkbehandling och datorlingvistik
Forskningsämne
Tal- och musikkommunikation
Identifikatorer
urn:nbn:se:kth:diva-323669 (URN)
Konferens
Swedish Language Technology Conference (SLTC),November 18-20 2020, Göteborg
Forskningsfinansiär
Vinnova, 2018-02427
Anmärkning

QC 20230215

Tillgänglig från: 2023-02-08 Skapad: 2023-02-08 Senast uppdaterad: 2025-02-07Bibliografiskt granskad
Organisationer
Identifikatorer
ORCID-id: ORCID iD iconorcid.org/0000-0002-9659-1532

Sök vidare i DiVA

Visa alla publikationer