kth.sePublications KTH
Operational message
There are currently operational disruptions. Troubleshooting is in progress.
Change search
Link to record
Permanent link

Direct link
Edlund, Jens, Docent/Associate ProfessorORCID iD iconorcid.org/0000-0001-9327-9482
Alternative names
Publications (10 of 157) Show all publications
Pandey, A., Edlund, J., Le Maguer, S. & Harte, N. (2026). The use of variable length stimuli for assessing segmental distortion in TTS evaluation. Computer speech & language (Print), 97, Article ID 101894.
Open this publication in new window or tab >>The use of variable length stimuli for assessing segmental distortion in TTS evaluation
2026 (English)In: Computer speech & language (Print), ISSN 0885-2308, E-ISSN 1095-8363, Vol. 97, article id 101894Article in journal (Refereed) Published
Abstract [en]

This paper presents the use of variable length stimuli for assessing segmental distortion in Text-to-Speech synthesizers. The design is based on the well-established principle of stimulus accumulation phenomenon in psychophysics. The length of the stimuli is varied logarithmically, in accordance with the Weber–Fechner law. User opinion is collected in a binary, two-choice format, suspending the vagueness of the term “naturalness”. The participants’ responses are captured using a 2-alternative forced choice task. The study found that while the length of the stimuli did not reliably affect participants’ accuracy in the task, the concentration of voiceless obstruents did have a significant effect. Participants were consistently more accurate in identifying WaveNet stimuli as machine-made when the phrases were obstruent-rich. These findings show that the deviation in obstruents reported in WaveNet voices is perceivable by human listeners. The design of the subjective listening test shows similar trends to Mean-Opinion-Score evaluation, suggesting that the design may be of utility to the wider community of Text-to-Speech evaluation.

Place, publisher, year, edition, pages
Elsevier BV, 2026
Keywords
Naturalness, Neural TTS, Obstruents, Segmental evaluation, Sonorants, Text-to-speech evaluation
National Category
Psychology
Identifiers
urn:nbn:se:kth:diva-373143 (URN)10.1016/j.csl.2025.101894 (DOI)001607689800001 ()2-s2.0-105020921824 (Scopus ID)
Note

QC 20251121

Available from: 2025-11-21 Created: 2025-11-21 Last updated: 2025-11-21Bibliographically approved
Ekström, A. G., Gärdenfors, P., Snyder, W. D., Friedrichs, D., McCarthy, R. C., Tsapos, M., . . . Moran, S. (2025). Correlates of Vocal Tract Evolution in Late Pliocene and Pleistocene Hominins. Human Nature, 36(1), 22-69
Open this publication in new window or tab >>Correlates of Vocal Tract Evolution in Late Pliocene and Pleistocene Hominins
Show others...
2025 (English)In: Human Nature, ISSN 1045-6767, E-ISSN 1936-4776, Vol. 36, no 1, p. 22-69Article in journal (Refereed) Published
Abstract [en]

Despite decades of research on the emergence of human speech capacities, an integrative account consistent with hominin evolution remains lacking. We review paleoanthropological and archaeological findings in search of a timeline for the emergence of modern human articulatory morphological features. Our synthesis shows that several behavioral innovations coincide with morphological changes to the would-be speech articulators. We find that significant reductions of the mandible and masticatory muscles and vocal tract anatomy coincide in the hominin fossil record with the incorporation of processed and (ultimately) cooked food, the appearance and development of rudimentary stone tools, increases in brain size, and likely changes to social life and organization. Many changes are likely mutually reinforcing; for example, gracilization of the hominin mandible may have been maintainable in the lineage because food processing had already been outsourced to the hands and stone tools, reducing selection pressures for robust mandibles in the process. We highlight correlates of the evolution of craniofacial and vocal tract features in the hominin lineage and outline a timeline by which our ancestors became ‘pre-adapted’ for the evolution of fully modern human speech.

Place, publisher, year, edition, pages
Springer Nature, 2025
National Category
Comparative Language Studies and Linguistics
Identifiers
urn:nbn:se:kth:diva-372082 (URN)10.1007/s12110-025-09487-9 (DOI)001469002100001 ()40244547 (PubMedID)2-s2.0-105002813677 (Scopus ID)
Funder
Swedish Research Council, 2017–00626KTH Royal Institute of Technology
Note

Correction in DOI 10.1007/s12110-025-09501-0

QC 20251023

Available from: 2025-10-23 Created: 2025-10-23 Last updated: 2025-10-28Bibliographically approved
Tånnander, C., House, D., Beskow, J. & Edlund, J. (2025). Intrasentential English in Swedish TTS: perceived English-accentedness. In: Interspeech 2025: . Paper presented at 26th Interspeech Conference 2025, Rotterdam, Netherlands, Kingdom of the, August 17-21, 2025 (pp. 1638-1642). International Speech Communication Association
Open this publication in new window or tab >>Intrasentential English in Swedish TTS: perceived English-accentedness
2025 (English)In: Interspeech 2025, International Speech Communication Association , 2025, p. 1638-1642Conference paper, Published paper (Refereed)
Abstract [en]

English names and expressions are frequently inserted into Swedish text. Humans intuitively adjust the degree of English pronunciation of such insertions. This work aims at a Swedish text-to-speech synthesis (TTS) capable of similar controlled adaptation. We focus on two key aspects: (1) the development of a TTS system with controllable degrees of perceived English-accentedness (PEA); and (2) the exploration of human preferences related to PEA. We trained a Swedish TTS voice on Swedish and English sentences with a conditioning parameter for language (English-accentedness, EA) on a scale from 0 to 1, and estimated a psychometric mapping of the perceived effect of EA to a perceptual scale (PEA) through perception tests. PEA was then used in Best-Worst listening tests presenting English insertions with varying PEA. The results confirm the effectiveness of the training and the PEA scale, and that listener preferences change with different insertions.

Place, publisher, year, edition, pages
International Speech Communication Association, 2025
Keywords
controllable TTS, mixed language, read speech
National Category
Natural Language Processing
Identifiers
urn:nbn:se:kth:diva-372797 (URN)10.21437/Interspeech.2025-762 (DOI)2-s2.0-105020040227 (Scopus ID)
Conference
26th Interspeech Conference 2025, Rotterdam, Netherlands, Kingdom of the, August 17-21, 2025
Note

QC 20251118

Available from: 2025-11-18 Created: 2025-11-18 Last updated: 2025-11-18Bibliographically approved
Kirkland, A. & Edlund, J. (2025). Who knows best? Effects of speech disfluencies on incentivized decision-making. In: Interspeech 2025: . Paper presented at 26th Interspeech Conference 2025, Rotterdam, Netherlands, Kingdom of the, August 17-21, 2025 (pp. 4508-4512). International Speech Communication Association
Open this publication in new window or tab >>Who knows best? Effects of speech disfluencies on incentivized decision-making
2025 (English)In: Interspeech 2025, International Speech Communication Association , 2025, p. 4508-4512Conference paper, Published paper (Refereed)
Abstract [en]

Previous work has shown that speech disfluencies can negatively impact judgments about a speaker's competence and confidence. However, these effects have primarily been examined with Likert-type rating scales, which are not informative about how judgments might translate to behavior. Does the presence of disfluencies actually guide decision-making when listeners stand to gain concretely from making the correct choice? We sought to address this question with a web-based decision task in which participants were asked to choose between two conflicting sources of information. Our results suggest that listeners do take speech fluency into account when deciding who or what to believe.

Place, publisher, year, edition, pages
International Speech Communication Association, 2025
Keywords
decision-making, paralinguistics, speech perception, text-to-speech
National Category
Comparative Language Studies and Linguistics
Identifiers
urn:nbn:se:kth:diva-372789 (URN)10.21437/Interspeech.2025-1990 (DOI)2-s2.0-105020069715 (Scopus ID)
Conference
26th Interspeech Conference 2025, Rotterdam, Netherlands, Kingdom of the, August 17-21, 2025
Note

QC 20251119

Available from: 2025-11-19 Created: 2025-11-19 Last updated: 2025-11-19Bibliographically approved
Edlund, J., Tånnander, C., Le Maguer, S. & Wagner, P. (2024). Assessing the impact of contextual framing on subjective TTS quality. In: Interspeech 2024: . Paper presented at 25th Interspeech Conferece 2024, Kos Island, Greece, September 1-5, 2024 (pp. 1205-1209). International Speech Communication Association
Open this publication in new window or tab >>Assessing the impact of contextual framing on subjective TTS quality
2024 (English)In: Interspeech 2024, International Speech Communication Association , 2024, p. 1205-1209Conference paper, Published paper (Refereed)
Abstract [en]

Text-To-Speech (TTS) evaluations are habitually carried out without contextual and situational framing. Since humans adapt their speaking style to situation specific communicative needs, such evaluations may not generalize across situations. Without clearly defined framing, it is even unclear in which situations evaluation results hold at all. We test the hypothesized impact of framing on TTS evaluation in a crowdsourced MOS evaluation of four TTS voices, systematically varying (a) the intended TTS task (domestic humanoid robot, child's voice replacement, fiction audio books and long and information-rich texts) and (b) the framing of that task. The results show that framing differentiated MOS responses, with individual TTS performance varying significantly across tasks and framings. This corroborates the assumption that decontextualized MOS evaluations do not generalize, and suggests that TTS evaluations should not be reported without the type of framing that was employed, if any.

Place, publisher, year, edition, pages
International Speech Communication Association, 2024
Keywords
evaluation, framing, methodology, MOS
National Category
Natural Language Processing
Identifiers
urn:nbn:se:kth:diva-358870 (URN)10.21437/Interspeech.2024-781 (DOI)001331850101070 ()2-s2.0-85214812427 (Scopus ID)
Conference
25th Interspeech Conferece 2024, Kos Island, Greece, September 1-5, 2024
Note

QC 20250127

Available from: 2025-01-23 Created: 2025-01-23 Last updated: 2025-12-08Bibliographically approved
Tånnander, C., Mehta, S., Beskow, J. & Edlund, J. (2024). Beyond graphemes and phonemes: continuous phonological features in neural text-to-speech synthesis. In: Interspeech 2024: . Paper presented at 25th Interspeech Conferece 2024, Kos Island, Greece, Sep 1 2024 - Sep 5 2024 (pp. 2815-2819). International Speech Communication Association
Open this publication in new window or tab >>Beyond graphemes and phonemes: continuous phonological features in neural text-to-speech synthesis
2024 (English)In: Interspeech 2024, International Speech Communication Association , 2024, p. 2815-2819Conference paper, Published paper (Refereed)
Abstract [en]

We introduce continuous phonological features as input to TTS with the dual objective of more precise control over phonological aspects and better potential for exploration of latent features in TTS models for speech science purposes. In our framework, the TTS is conditioned on continuous values between 0.0 and 1.0, where each phoneme has a specified position on each feature axis. We chose 11 features to represent US English and trained a voice with Matcha-TTS. Effectiveness was assessed by investigating two selected features in two ways: through a categorical perception experiment confirming the expected alignment of feature positions and phoneme perception, and through analysis of acoustic correlates confirming a gradual, monotonic change of acoustic features consistent with changes in the phonemic input features.

Place, publisher, year, edition, pages
International Speech Communication Association, 2024
Keywords
analysis-by-synthesis, controllable text-to-speech synthesis, phonological features
National Category
Natural Language Processing Computer Sciences
Identifiers
urn:nbn:se:kth:diva-358877 (URN)10.21437/Interspeech.2024-1565 (DOI)001331850102192 ()2-s2.0-85214785956 (Scopus ID)
Conference
25th Interspeech Conferece 2024, Kos Island, Greece, Sep 1 2024 - Sep 5 2024
Note

QC 20250128

Available from: 2025-01-23 Created: 2025-01-23 Last updated: 2025-12-08Bibliographically approved
Ekström, A. G., Gannon, C., Edlund, J., Moran, S. & Lameira, A. R. (2024). Chimpanzee utterances refute purported missing links for novel vocalizations and syllabic speech. Scientific Reports, 14(1), Article ID 17135.
Open this publication in new window or tab >>Chimpanzee utterances refute purported missing links for novel vocalizations and syllabic speech
Show others...
2024 (English)In: Scientific Reports, E-ISSN 2045-2322, Vol. 14, no 1, article id 17135Article in journal (Refereed) Published
Abstract [en]

Nonhuman great apes have been claimed to be unable to learn human words due to a lack of the necessary neural circuitry. We recovered original footage of two enculturated chimpanzees uttering the word “mama” and subjected recordings to phonetic analysis. Our analyses demonstrate that chimpanzees are capable of syllabic production, achieving consonant-to-vowel phonetic contrasts via the simultaneous recruitment and coupling of voice, jaw and lips. In an online experiment, human listeners naive to the recordings’ origins reliably perceived chimpanzee utterances as syllabic utterances, primarily as “ma-ma”, among foil syllables. Our findings demonstrate that in the absence of direct data-driven examination, great ape vocal production capacities have been underestimated. Chimpanzees possess the neural building blocks necessary for speech.

Place, publisher, year, edition, pages
Springer Nature, 2024
Keywords
Phonetics, Primatology, Vocal learning
National Category
Zoology
Research subject
Speech and Music Communication
Identifiers
urn:nbn:se:kth:diva-351240 (URN)10.1038/s41598-024-67005-w (DOI)001278002800007 ()39054330 (PubMedID)2-s2.0-85199430867 (Scopus ID)
Funder
Swedish Research Council, 2017-00626KTH Royal Institute of Technology
Note

QC 20240805

Available from: 2024-08-04 Created: 2024-08-04 Last updated: 2024-08-27Bibliographically approved
Tånnander, C., O'Regan, J., House, D., Edlund, J. & Beskow, J. (2024). Prosodic characteristics of English-accented Swedish neural TTS. In: Proceedings of Speech Prosody 2024: . Paper presented at Speech Prosody 2024, Leiden, The Netherlands, 2-5 July 2024 (pp. 1035-1039). Leiden, The Netherlands: International Speech Communication Association
Open this publication in new window or tab >>Prosodic characteristics of English-accented Swedish neural TTS
Show others...
2024 (English)In: Proceedings of Speech Prosody 2024, Leiden, The Netherlands: International Speech Communication Association , 2024, p. 1035-1039Conference paper, Published paper (Refereed)
Abstract [en]

Neural text-to-speech synthesis (TTS) captures prosodicfeatures strikingly well, notwithstanding the lack of prosodiclabels in training or synthesis. We trained a voice on a singleSwedish speaker reading in Swedish and English. The resultingTTS allows us to control the degree of English-accentedness inSwedish sentences. English-accented Swedish commonlyexhibits well-known prosodic characteristics such as erroneoustonal accents and understated or missed durational differences.TTS quality was verified in three ways. Automatic speechrecognition resulted in low errors, verifying intelligibility.Automatic language classification had Swedish as the majoritychoice, while the likelihood of English increased with ourtargeted degree of English-accentedness. Finally, a rank ofperceived English-accentedness acquired through pairwisecomparisons by 20 human listeners demonstrated a strongcorrelation with the targeted English-accentedness.We report on phonetic and prosodic analyses of theaccented TTS. In addition to the anticipated segmentaldifferences, the analyses revealed temporal and prominencerelated variations coherent with Swedish spoken by Englishspeakers, such as missing Swedish stress patterns and overlyreduced unstressed syllables. With this work, we aim to gleaninsights into speech prosody from the latent prosodic featuresof neural TTS models. In addition, it will help implementspeech phenomena such as code switching in TTS

Place, publisher, year, edition, pages
Leiden, The Netherlands: International Speech Communication Association, 2024
Keywords
foreign-accented text-to-speech synthesis, neural text-to-speech synthesis, latent prosodic features
National Category
Humanities and the Arts General Language Studies and Linguistics
Research subject
Speech and Music Communication
Identifiers
urn:nbn:se:kth:diva-349946 (URN)10.21437/SpeechProsody.2024-209 (DOI)2-s2.0-105008058763 (Scopus ID)
Conference
Speech Prosody 2024, Leiden, The Netherlands, 2-5 July 2024
Projects
Deep learning based speech synthesis for reading aloud of lengthy and information rich texts in Swedish (2018-02427)Språkbanken Tal (2017-00626)
Funder
Vinnova, (2018-02427
Note

QC 20240705

Available from: 2024-07-03 Created: 2024-07-03 Last updated: 2025-07-01Bibliographically approved
Tånnander, C., Edlund, J. & Gustafsson, J. (2024). Revisiting Three Text-to-Speech Synthesis Experiments with a Web-Based Audience Response System. In: 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation, LREC-COLING 2024 - Main Conference Proceedings: . Paper presented at Joint 30th International Conference on Computational Linguistics and 14th International Conference on Language Resources and Evaluation, LREC-COLING 2024, Hybrid, Torino, Italy, May 20 2024 - May 25 2024 (pp. 14111-14121). European Language Resources Association (ELRA)
Open this publication in new window or tab >>Revisiting Three Text-to-Speech Synthesis Experiments with a Web-Based Audience Response System
2024 (English)In: 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation, LREC-COLING 2024 - Main Conference Proceedings, European Language Resources Association (ELRA) , 2024, p. 14111-14121Conference paper, Published paper (Refereed)
Abstract [en]

In order to investigate the strengths and weaknesses of Audience Response System (ARS) in text-to-speech synthesis (TTS) evaluations, we revisit three previously published TTS studies and perform an ARS-based evaluation on the stimuli used in each study. The experiments are performed with a participant pool of 39 respondents, using a web-based tool that emulates an ARS experiment. The results of the first experiment confirms that ARS is highly useful for evaluating long and continuous stimuli, particularly if we wish for a diagnostic result rather than a single overall metric, while the second and third experiments highlight weaknesses in ARS with unsuitable materials as well as the importance of framing and instruction when conducting ARS-based evaluation.

Place, publisher, year, edition, pages
European Language Resources Association (ELRA), 2024
Keywords
audience response system, evaluation methodology, TTS evaluation
National Category
Natural Language Processing
Identifiers
urn:nbn:se:kth:diva-348784 (URN)2-s2.0-85195897862 (Scopus ID)
Conference
Joint 30th International Conference on Computational Linguistics and 14th International Conference on Language Resources and Evaluation, LREC-COLING 2024, Hybrid, Torino, Italy, May 20 2024 - May 25 2024
Note

Part of ISBN 9782493814104

QC 20240701

Available from: 2024-06-27 Created: 2024-06-27 Last updated: 2025-02-07Bibliographically approved
Ekström, A. G. & Edlund, J. (2024). Sketches of chimpanzee (Pan troglodytes) hoo’s: vowels by any other name?. Primates, 65(2), 81-88
Open this publication in new window or tab >>Sketches of chimpanzee (Pan troglodytes) hoo’s: vowels by any other name?
2024 (English)In: Primates, ISSN 0032-8332, E-ISSN 1610-7365, Vol. 65, no 2, p. 81-88Article in journal (Refereed) Published
Abstract [en]

In human speech, the close back rounded vowel /u/ (the vowel in “boot”) is articulated with the tongue arched toward the dorsal boundary of the hard palate, with the pharyngeal cavity open. Acoustic and perceptual properties of chimpanzee (Pan troglodytes) hoo’s are similar to those of the human vowel /u/. However, the vocal tract morphology of chimpanzees likely limits their phonetic capabilities, so that it is unlikely, or even impossible, that their articulation is comparable to that of a human. To determine how qualities of the vowel /u/ may be achieved given the chimpanzee vocal tract, we calculated transfer functions of the vocal tract area for tube models of vocal tract configurations in which vocal tract length, length and area of a laryngeal air sac simulacrum, length of lip protrusion, and area of lip opening were systematically varied. The method described is principally acoustic; we make no claim as to the actual shape of the chimpanzee vocal tract during call production. Nonetheless, we demonstrate that it may be possible to achieve the acoustic and perceptual qualities of back vowels without a reconfigured human vocal tract. The results, while tentative, suggest that the production of hoo’s by chimpanzees, while achieving comparable vowel-like qualities to the human /u/, may involve articulatory gestures that are beyond the range of the human articulators. The purpose of this study was to (1) stimulate further simulation research on great ape articulation, and (2) show that apparently vowel-like phenomena in nature are not necessarily indicative of evolutionary continuity per se.

Place, publisher, year, edition, pages
Springer Nature, 2024
Keywords
Articulatory phonetics, Primatology, Speech acoustics, Vowel quality
National Category
Comparative Language Studies and Linguistics Natural Language Processing
Identifiers
urn:nbn:se:kth:diva-367104 (URN)10.1007/s10329-023-01107-3 (DOI)001126347800001 ()38110671 (PubMedID)2-s2.0-85180178929 (Scopus ID)
Note

QC 20250715

Available from: 2025-07-15 Created: 2025-07-15 Last updated: 2025-07-15Bibliographically approved
Organisations
Identifiers
ORCID iD: ORCID iD iconorcid.org/0000-0001-9327-9482

Search in DiVA

Show all publications