kth.sePublications
Change search
Link to record
Permanent link

Direct link
Edlund, Jens, Docent/Associate ProfessorORCID iD iconorcid.org/0000-0001-9327-9482
Alternative names
Publications (10 of 152) Show all publications
Edlund, J., Tånnander, C., Le Maguer, S. & Wagner, P. (2024). Assessing the impact of contextual framing on subjective TTS quality. In: Interspeech 2024: . Paper presented at 25th Interspeech Conferece 2024, Kos Island, Greece, September 1-5, 2024 (pp. 1205-1209). International Speech Communication Association
Open this publication in new window or tab >>Assessing the impact of contextual framing on subjective TTS quality
2024 (English)In: Interspeech 2024, International Speech Communication Association , 2024, p. 1205-1209Conference paper, Published paper (Refereed)
Abstract [en]

Text-To-Speech (TTS) evaluations are habitually carried out without contextual and situational framing. Since humans adapt their speaking style to situation specific communicative needs, such evaluations may not generalize across situations. Without clearly defined framing, it is even unclear in which situations evaluation results hold at all. We test the hypothesized impact of framing on TTS evaluation in a crowdsourced MOS evaluation of four TTS voices, systematically varying (a) the intended TTS task (domestic humanoid robot, child's voice replacement, fiction audio books and long and information-rich texts) and (b) the framing of that task. The results show that framing differentiated MOS responses, with individual TTS performance varying significantly across tasks and framings. This corroborates the assumption that decontextualized MOS evaluations do not generalize, and suggests that TTS evaluations should not be reported without the type of framing that was employed, if any.

Place, publisher, year, edition, pages
International Speech Communication Association, 2024
Keywords
evaluation, framing, methodology, MOS
National Category
Natural Language Processing
Identifiers
urn:nbn:se:kth:diva-358870 (URN)10.21437/Interspeech.2024-781 (DOI)2-s2.0-85214812427 (Scopus ID)
Conference
25th Interspeech Conferece 2024, Kos Island, Greece, September 1-5, 2024
Note

QC 20250127

Available from: 2025-01-23 Created: 2025-01-23 Last updated: 2025-01-27Bibliographically approved
Tånnander, C., Mehta, S., Beskow, J. & Edlund, J. (2024). Beyond graphemes and phonemes: continuous phonological features in neural text-to-speech synthesis. In: Interspeech 2024: . Paper presented at 25th Interspeech Conferece 2024, Kos Island, Greece, Sep 1 2024 - Sep 5 2024 (pp. 2815-2819). International Speech Communication Association
Open this publication in new window or tab >>Beyond graphemes and phonemes: continuous phonological features in neural text-to-speech synthesis
2024 (English)In: Interspeech 2024, International Speech Communication Association , 2024, p. 2815-2819Conference paper, Published paper (Refereed)
Abstract [en]

We introduce continuous phonological features as input to TTS with the dual objective of more precise control over phonological aspects and better potential for exploration of latent features in TTS models for speech science purposes. In our framework, the TTS is conditioned on continuous values between 0.0 and 1.0, where each phoneme has a specified position on each feature axis. We chose 11 features to represent US English and trained a voice with Matcha-TTS. Effectiveness was assessed by investigating two selected features in two ways: through a categorical perception experiment confirming the expected alignment of feature positions and phoneme perception, and through analysis of acoustic correlates confirming a gradual, monotonic change of acoustic features consistent with changes in the phonemic input features.

Place, publisher, year, edition, pages
International Speech Communication Association, 2024
Keywords
analysis-by-synthesis, controllable text-to-speech synthesis, phonological features
National Category
Natural Language Processing Computer Sciences
Identifiers
urn:nbn:se:kth:diva-358877 (URN)10.21437/Interspeech.2024-1565 (DOI)2-s2.0-85214785956 (Scopus ID)
Conference
25th Interspeech Conferece 2024, Kos Island, Greece, Sep 1 2024 - Sep 5 2024
Note

QC 20250128

Available from: 2025-01-23 Created: 2025-01-23 Last updated: 2025-01-28Bibliographically approved
Ekström, A. G., Gannon, C., Edlund, J., Moran, S. & Lameira, A. R. (2024). Chimpanzee utterances refute purported missing links for novel vocalizations and syllabic speech. Scientific Reports, 14(1), Article ID 17135.
Open this publication in new window or tab >>Chimpanzee utterances refute purported missing links for novel vocalizations and syllabic speech
Show others...
2024 (English)In: Scientific Reports, E-ISSN 2045-2322, Vol. 14, no 1, article id 17135Article in journal (Refereed) Published
Abstract [en]

Nonhuman great apes have been claimed to be unable to learn human words due to a lack of the necessary neural circuitry. We recovered original footage of two enculturated chimpanzees uttering the word “mama” and subjected recordings to phonetic analysis. Our analyses demonstrate that chimpanzees are capable of syllabic production, achieving consonant-to-vowel phonetic contrasts via the simultaneous recruitment and coupling of voice, jaw and lips. In an online experiment, human listeners naive to the recordings’ origins reliably perceived chimpanzee utterances as syllabic utterances, primarily as “ma-ma”, among foil syllables. Our findings demonstrate that in the absence of direct data-driven examination, great ape vocal production capacities have been underestimated. Chimpanzees possess the neural building blocks necessary for speech.

Place, publisher, year, edition, pages
Springer Nature, 2024
Keywords
Phonetics, Primatology, Vocal learning
National Category
Zoology
Research subject
Speech and Music Communication
Identifiers
urn:nbn:se:kth:diva-351240 (URN)10.1038/s41598-024-67005-w (DOI)001278002800007 ()39054330 (PubMedID)2-s2.0-85199430867 (Scopus ID)
Funder
Swedish Research Council, 2017-00626KTH Royal Institute of Technology
Note

QC 20240805

Available from: 2024-08-04 Created: 2024-08-04 Last updated: 2024-08-27Bibliographically approved
Tånnander, C., O'Regan, J., House, D., Edlund, J. & Beskow, J. (2024). Prosodic characteristics of English-accented Swedish neural TTS. In: Proceedings of Speech Prosody 2024: . Paper presented at Speech Prosody 2024, Leiden, The Netherlands, 2-5 July 2024 (pp. 1035-1039). Leiden, The Netherlands: International Speech Communication Association
Open this publication in new window or tab >>Prosodic characteristics of English-accented Swedish neural TTS
Show others...
2024 (English)In: Proceedings of Speech Prosody 2024, Leiden, The Netherlands: International Speech Communication Association , 2024, p. 1035-1039Conference paper, Published paper (Refereed)
Abstract [en]

Neural text-to-speech synthesis (TTS) captures prosodicfeatures strikingly well, notwithstanding the lack of prosodiclabels in training or synthesis. We trained a voice on a singleSwedish speaker reading in Swedish and English. The resultingTTS allows us to control the degree of English-accentedness inSwedish sentences. English-accented Swedish commonlyexhibits well-known prosodic characteristics such as erroneoustonal accents and understated or missed durational differences.TTS quality was verified in three ways. Automatic speechrecognition resulted in low errors, verifying intelligibility.Automatic language classification had Swedish as the majoritychoice, while the likelihood of English increased with ourtargeted degree of English-accentedness. Finally, a rank ofperceived English-accentedness acquired through pairwisecomparisons by 20 human listeners demonstrated a strongcorrelation with the targeted English-accentedness.We report on phonetic and prosodic analyses of theaccented TTS. In addition to the anticipated segmentaldifferences, the analyses revealed temporal and prominencerelated variations coherent with Swedish spoken by Englishspeakers, such as missing Swedish stress patterns and overlyreduced unstressed syllables. With this work, we aim to gleaninsights into speech prosody from the latent prosodic featuresof neural TTS models. In addition, it will help implementspeech phenomena such as code switching in TTS

Place, publisher, year, edition, pages
Leiden, The Netherlands: International Speech Communication Association, 2024
Keywords
foreign-accented text-to-speech synthesis, neural text-to-speech synthesis, latent prosodic features
National Category
Humanities and the Arts General Language Studies and Linguistics
Research subject
Speech and Music Communication
Identifiers
urn:nbn:se:kth:diva-349946 (URN)10.21437/SpeechProsody.2024-209 (DOI)
Conference
Speech Prosody 2024, Leiden, The Netherlands, 2-5 July 2024
Projects
Deep learning based speech synthesis for reading aloud of lengthy and information rich texts in Swedish (2018-02427)Språkbanken Tal (2017-00626)
Funder
Vinnova, (2018-02427
Note

QC 20240705

Available from: 2024-07-03 Created: 2024-07-03 Last updated: 2024-07-05Bibliographically approved
Tånnander, C., Edlund, J. & Gustafsson, J. (2024). Revisiting Three Text-to-Speech Synthesis Experiments with a Web-Based Audience Response System. In: 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation, LREC-COLING 2024 - Main Conference Proceedings: . Paper presented at Joint 30th International Conference on Computational Linguistics and 14th International Conference on Language Resources and Evaluation, LREC-COLING 2024, Hybrid, Torino, Italy, May 20 2024 - May 25 2024 (pp. 14111-14121). European Language Resources Association (ELRA)
Open this publication in new window or tab >>Revisiting Three Text-to-Speech Synthesis Experiments with a Web-Based Audience Response System
2024 (English)In: 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation, LREC-COLING 2024 - Main Conference Proceedings, European Language Resources Association (ELRA) , 2024, p. 14111-14121Conference paper, Published paper (Refereed)
Abstract [en]

In order to investigate the strengths and weaknesses of Audience Response System (ARS) in text-to-speech synthesis (TTS) evaluations, we revisit three previously published TTS studies and perform an ARS-based evaluation on the stimuli used in each study. The experiments are performed with a participant pool of 39 respondents, using a web-based tool that emulates an ARS experiment. The results of the first experiment confirms that ARS is highly useful for evaluating long and continuous stimuli, particularly if we wish for a diagnostic result rather than a single overall metric, while the second and third experiments highlight weaknesses in ARS with unsuitable materials as well as the importance of framing and instruction when conducting ARS-based evaluation.

Place, publisher, year, edition, pages
European Language Resources Association (ELRA), 2024
Keywords
audience response system, evaluation methodology, TTS evaluation
National Category
Natural Language Processing
Identifiers
urn:nbn:se:kth:diva-348784 (URN)2-s2.0-85195897862 (Scopus ID)
Conference
Joint 30th International Conference on Computational Linguistics and 14th International Conference on Language Resources and Evaluation, LREC-COLING 2024, Hybrid, Torino, Italy, May 20 2024 - May 25 2024
Note

Part of ISBN 9782493814104

QC 20240701

Available from: 2024-06-27 Created: 2024-06-27 Last updated: 2025-02-07Bibliographically approved
Esfandiari-Baiat, G. & Edlund, J. (2024). The MEET Corpus: Collocated, Distant and Hybrid Three-party Meetings with a Ranking Task. In: ISA 2024: 20th Joint ACL - ISO Workshop on Interoperable Semantic Annotation at LREC-COLING 2024, Workshop Proceedings: . Paper presented at 20th Joint ACL - ISO Workshop on Interoperable Semantic Annotation, ISA 2024, Torino, Italy, May 20 2024 (pp. 1-7). European Language Resources Association (ELRA)
Open this publication in new window or tab >>The MEET Corpus: Collocated, Distant and Hybrid Three-party Meetings with a Ranking Task
2024 (English)In: ISA 2024: 20th Joint ACL - ISO Workshop on Interoperable Semantic Annotation at LREC-COLING 2024, Workshop Proceedings, European Language Resources Association (ELRA) , 2024, p. 1-7Conference paper, Published paper (Refereed)
Abstract [en]

We introduce the MEET corpus. The corpus was collected with the aim of systematically studying the effects of collocated (physical), remote (digital) and hybrid work meetings on collaborative decision-making. It consists of 10 sessions, where each session contains three recordings: a collocated, a remote and a hybrid meeting between three participants. The participants are working on a different survival ranking task during each meeting. The duration of each meeting ranges from 10 to 18 minutes, resulting in 380 minutes of conversation altogether. We also present the annotation scheme designed specifically to target our research questions. The recordings are currently being transcribed and annotated in accordance with this scheme.

Place, publisher, year, edition, pages
European Language Resources Association (ELRA), 2024
Keywords
annotation scheme, meetings, multimodal corpora
National Category
Natural Language Processing
Identifiers
urn:nbn:se:kth:diva-347701 (URN)2-s2.0-85195184356 (Scopus ID)
Conference
20th Joint ACL - ISO Workshop on Interoperable Semantic Annotation, ISA 2024, Torino, Italy, May 20 2024
Note

QC 20240613

Part of ISBN 978-249381432-6

Available from: 2024-06-13 Created: 2024-06-13 Last updated: 2025-02-07Bibliographically approved
Tånnander, C., House, D. & Edlund, J. (2023). Analysis-by-synthesis: phonetic-phonological variation indeep neural network-based text-to-speech synthesis. In: Radek Skarnitzl and Jan Volín (Ed.), Proceedings of the 20th International Congress of Phonetic Sciences, Prague 2023: . Paper presented at 20th International Congress of Phonetic Sciences (ICPhS), August 7-11, 2023, Prague, Czech Republic (pp. 3156-3160). Prague, Czech Republic: GUARANT International
Open this publication in new window or tab >>Analysis-by-synthesis: phonetic-phonological variation indeep neural network-based text-to-speech synthesis
2023 (English)In: Proceedings of the 20th International Congress of Phonetic Sciences, Prague 2023 / [ed] Radek Skarnitzl and Jan Volín, Prague, Czech Republic: GUARANT International , 2023, p. 3156-3160Conference paper, Published paper (Refereed)
Abstract [en]

Text-to-speech synthesis based on deep neuralnetworks can generate highly humanlike speech,which revitalizes the potential for analysis-bysynthesis in speech research. We propose that neuralsynthesis can provide evidence that a specificdistinction in its transcription system represents arobust acoustic/phonetic distinction in the speechused to train the model.We synthesized utterances with allophones inincorrect contexts and analyzed the resultsphonetically. Our assumption was that if we gainedcontrol over the allophonic variation in this way, itwould provide strong evidence that the variation isgoverned robustly by the phonological context usedto create the transcriptions.Of three allophonic variations investigated, thefirst, which was believed to be quite robust, gave usrobust control over the variation, while the other two,which are less categorical, did not afford us suchcontrol. These findings are consistent with ourhypothesis and support the notion that neural TTS canbe a valuable analysis-by-synthesis tool for speechresearch. 

Place, publisher, year, edition, pages
Prague, Czech Republic: GUARANT International, 2023
Keywords
analysis-by-synthesis, latent phonetic features, phonological variation, neural TTS
National Category
Other Engineering and Technologies
Research subject
Speech and Music Communication
Identifiers
urn:nbn:se:kth:diva-336586 (URN)
Conference
20th International Congress of Phonetic Sciences (ICPhS), August 7-11, 2023, Prague, Czech Republic
Funder
Vinnova, 2018-02427
Note

Part of ISBN 978-80-908 114-2-3

QC 20230915

Available from: 2023-09-14 Created: 2023-09-14 Last updated: 2025-02-10Bibliographically approved
Fallgren, P. & Edlund, J. (2023). Crowdsource-based validation of the audio cocktail as a sound browsing tool. In: Interspeech 2023: . Paper presented at 24th International Speech Communication Association, Interspeech 2023, August 20-24, 2023, Dublin, Ireland (pp. 2178-2182). International Speech Communication Association
Open this publication in new window or tab >>Crowdsource-based validation of the audio cocktail as a sound browsing tool
2023 (English)In: Interspeech 2023, International Speech Communication Association , 2023, p. 2178-2182Conference paper, Published paper (Refereed)
Abstract [en]

We conduct two crowdsourcing experiments designed to examine the usefulness of audio cocktails to quickly find out information on the contents of large audio data. Several thousand crowd workers were engaged to listen to audio cocktails with systematically varied composition. They were then asked to state either which sound out of four categories (Children, Women, Men, Orchestra) they heard the most of, or if they heard anything of a specific category at all. The results show that their responses have high reliability and provide information as to whether a specific task can be performed using audio cocktails. We also propose that the combination of crowd workers and audio cocktails can be used directly as a tool to investigate the contents of large audio data.

Place, publisher, year, edition, pages
International Speech Communication Association, 2023
Keywords
annotation, exploration, found speech, hearing, human-in-the-loop
National Category
Natural Language Processing Other Humanities not elsewhere specified
Identifiers
urn:nbn:se:kth:diva-337834 (URN)10.21437/Interspeech.2023-2473 (DOI)001186650302072 ()2-s2.0-85171584146 (Scopus ID)
Conference
24th International Speech Communication Association, Interspeech 2023, August 20-24, 2023, Dublin, Ireland
Note

QC 20241014

Available from: 2023-10-09 Created: 2023-10-09 Last updated: 2025-02-01Bibliographically approved
Ekström, A. G. & Edlund, J. (2023). Evolution of the human tongue and emergence of speech biomechanics. Frontiers in Psychology, 14, Article ID 1150778.
Open this publication in new window or tab >>Evolution of the human tongue and emergence of speech biomechanics
2023 (English)In: Frontiers in Psychology, E-ISSN 1664-1078, Vol. 14, article id 1150778Article, review/survey (Refereed) Published
Abstract [en]

The tongue is one of the organs most central to human speech. Here, the evolution and species-unique properties of the human tongue is traced, via reference to the apparent articulatory behavior of extant non-human great apes, and fossil findings from early hominids - from a point of view of articulatory phonetics, the science of human speech production. Increased lingual flexibility provided the possibility of mapping of articulatory targets, possibly via exaptation of manual-gestural mapping capacities evident in extant great apes. The emergence of the human-specific tongue, its properties, and morphology were crucial to the evolution of human articulate speech.

Place, publisher, year, edition, pages
Frontiers Media SA, 2023
Keywords
evolution of speech, speech articulation, human evolution, speech production, primatology, articulatory phonetics, coarticulation, speech motor control
National Category
Other Medical Sciences not elsewhere specified
Identifiers
urn:nbn:se:kth:diva-330517 (URN)10.3389/fpsyg.2023.1150778 (DOI)001004893900001 ()37325743 (PubMedID)2-s2.0-85162047256 (Scopus ID)
Note

QC 20230630

Available from: 2023-06-30 Created: 2023-06-30 Last updated: 2023-06-30Bibliographically approved
Borin, L., Domeij, R., Edlund, J. & Forsberg, M. (2023). Language Report Swedish. In: Cognitive Technologies: (pp. 219-222). Springer Nature, Part F280
Open this publication in new window or tab >>Language Report Swedish
2023 (English)In: Cognitive Technologies, Springer Nature , 2023, Vol. Part F280, p. 219-222Chapter in book (Other academic)
Abstract [en]

Swedish speech and language technology (LT) research goes back over 70 years. This has paid off: there is a national research infrastructure, as well as significant research projects, and Swedish is well-endowed with language resources (LRs) and tools. However, there are gaps that need to be filled, especially high-quality goldstandard LRs required by the most recent deep-learning methods. In the future, we would like to see closer collaborations and communication between the “traditional” LT research community and the burgeoning AI field, the establishment of dedicated academic LT training programmes, and national funding for LT research.

Place, publisher, year, edition, pages
Springer Nature, 2023
National Category
Natural Language Processing
Identifiers
urn:nbn:se:kth:diva-333012 (URN)10.1007/978-3-031-28819-7_36 (DOI)2-s2.0-85161882703 (Scopus ID)
Note

QC 20230725

Available from: 2023-07-25 Created: 2023-07-25 Last updated: 2025-02-07Bibliographically approved
Organisations
Identifiers
ORCID iD: ORCID iD iconorcid.org/0000-0001-9327-9482

Search in DiVA

Show all publications