Endre søk
Link to record
Permanent link

Direct link
Edlund, Jens, Docent/Associate ProfessorORCID iD iconorcid.org/0000-0001-9327-9482
Alternativa namn
Publikasjoner (10 av 157) Visa alla publikasjoner
Pandey, A., Edlund, J., Le Maguer, S. & Harte, N. (2026). The use of variable length stimuli for assessing segmental distortion in TTS evaluation. Computer speech & language (Print), 97, Article ID 101894.
Åpne denne publikasjonen i ny fane eller vindu >>The use of variable length stimuli for assessing segmental distortion in TTS evaluation
2026 (engelsk)Inngår i: Computer speech & language (Print), ISSN 0885-2308, E-ISSN 1095-8363, Vol. 97, artikkel-id 101894Artikkel i tidsskrift (Fagfellevurdert) Published
Abstract [en]

This paper presents the use of variable length stimuli for assessing segmental distortion in Text-to-Speech synthesizers. The design is based on the well-established principle of stimulus accumulation phenomenon in psychophysics. The length of the stimuli is varied logarithmically, in accordance with the Weber–Fechner law. User opinion is collected in a binary, two-choice format, suspending the vagueness of the term “naturalness”. The participants’ responses are captured using a 2-alternative forced choice task. The study found that while the length of the stimuli did not reliably affect participants’ accuracy in the task, the concentration of voiceless obstruents did have a significant effect. Participants were consistently more accurate in identifying WaveNet stimuli as machine-made when the phrases were obstruent-rich. These findings show that the deviation in obstruents reported in WaveNet voices is perceivable by human listeners. The design of the subjective listening test shows similar trends to Mean-Opinion-Score evaluation, suggesting that the design may be of utility to the wider community of Text-to-Speech evaluation.

sted, utgiver, år, opplag, sider
Elsevier BV, 2026
Emneord
Naturalness, Neural TTS, Obstruents, Segmental evaluation, Sonorants, Text-to-speech evaluation
HSV kategori
Identifikatorer
urn:nbn:se:kth:diva-373143 (URN)10.1016/j.csl.2025.101894 (DOI)001607689800001 ()2-s2.0-105020921824 (Scopus ID)
Merknad

QC 20251121

Tilgjengelig fra: 2025-11-21 Laget: 2025-11-21 Sist oppdatert: 2025-11-21bibliografisk kontrollert
Ekström, A. G., Gärdenfors, P., Snyder, W. D., Friedrichs, D., McCarthy, R. C., Tsapos, M., . . . Moran, S. (2025). Correlates of Vocal Tract Evolution in Late Pliocene and Pleistocene Hominins. Human Nature, 36(1), 22-69
Åpne denne publikasjonen i ny fane eller vindu >>Correlates of Vocal Tract Evolution in Late Pliocene and Pleistocene Hominins
Vise andre…
2025 (engelsk)Inngår i: Human Nature, ISSN 1045-6767, E-ISSN 1936-4776, Vol. 36, nr 1, s. 22-69Artikkel i tidsskrift (Fagfellevurdert) Published
Abstract [en]

Despite decades of research on the emergence of human speech capacities, an integrative account consistent with hominin evolution remains lacking. We review paleoanthropological and archaeological findings in search of a timeline for the emergence of modern human articulatory morphological features. Our synthesis shows that several behavioral innovations coincide with morphological changes to the would-be speech articulators. We find that significant reductions of the mandible and masticatory muscles and vocal tract anatomy coincide in the hominin fossil record with the incorporation of processed and (ultimately) cooked food, the appearance and development of rudimentary stone tools, increases in brain size, and likely changes to social life and organization. Many changes are likely mutually reinforcing; for example, gracilization of the hominin mandible may have been maintainable in the lineage because food processing had already been outsourced to the hands and stone tools, reducing selection pressures for robust mandibles in the process. We highlight correlates of the evolution of craniofacial and vocal tract features in the hominin lineage and outline a timeline by which our ancestors became ‘pre-adapted’ for the evolution of fully modern human speech.

sted, utgiver, år, opplag, sider
Springer Nature, 2025
HSV kategori
Identifikatorer
urn:nbn:se:kth:diva-372082 (URN)10.1007/s12110-025-09487-9 (DOI)001469002100001 ()40244547 (PubMedID)2-s2.0-105002813677 (Scopus ID)
Forskningsfinansiär
Swedish Research Council, 2017–00626KTH Royal Institute of Technology
Merknad

Correction in DOI 10.1007/s12110-025-09501-0

QC 20251023

Tilgjengelig fra: 2025-10-23 Laget: 2025-10-23 Sist oppdatert: 2025-10-28bibliografisk kontrollert
Tånnander, C., House, D., Beskow, J. & Edlund, J. (2025). Intrasentential English in Swedish TTS: perceived English-accentedness. In: Interspeech 2025: . Paper presented at 26th Interspeech Conference 2025, Rotterdam, Netherlands, Kingdom of the, August 17-21, 2025 (pp. 1638-1642). International Speech Communication Association
Åpne denne publikasjonen i ny fane eller vindu >>Intrasentential English in Swedish TTS: perceived English-accentedness
2025 (engelsk)Inngår i: Interspeech 2025, International Speech Communication Association , 2025, s. 1638-1642Konferansepaper, Publicerat paper (Fagfellevurdert)
Abstract [en]

English names and expressions are frequently inserted into Swedish text. Humans intuitively adjust the degree of English pronunciation of such insertions. This work aims at a Swedish text-to-speech synthesis (TTS) capable of similar controlled adaptation. We focus on two key aspects: (1) the development of a TTS system with controllable degrees of perceived English-accentedness (PEA); and (2) the exploration of human preferences related to PEA. We trained a Swedish TTS voice on Swedish and English sentences with a conditioning parameter for language (English-accentedness, EA) on a scale from 0 to 1, and estimated a psychometric mapping of the perceived effect of EA to a perceptual scale (PEA) through perception tests. PEA was then used in Best-Worst listening tests presenting English insertions with varying PEA. The results confirm the effectiveness of the training and the PEA scale, and that listener preferences change with different insertions.

sted, utgiver, år, opplag, sider
International Speech Communication Association, 2025
Emneord
controllable TTS, mixed language, read speech
HSV kategori
Identifikatorer
urn:nbn:se:kth:diva-372797 (URN)10.21437/Interspeech.2025-762 (DOI)2-s2.0-105020040227 (Scopus ID)
Konferanse
26th Interspeech Conference 2025, Rotterdam, Netherlands, Kingdom of the, August 17-21, 2025
Merknad

QC 20251118

Tilgjengelig fra: 2025-11-18 Laget: 2025-11-18 Sist oppdatert: 2025-11-18bibliografisk kontrollert
Kirkland, A. & Edlund, J. (2025). Who knows best? Effects of speech disfluencies on incentivized decision-making. In: Interspeech 2025: . Paper presented at 26th Interspeech Conference 2025, Rotterdam, Netherlands, Kingdom of the, August 17-21, 2025 (pp. 4508-4512). International Speech Communication Association
Åpne denne publikasjonen i ny fane eller vindu >>Who knows best? Effects of speech disfluencies on incentivized decision-making
2025 (engelsk)Inngår i: Interspeech 2025, International Speech Communication Association , 2025, s. 4508-4512Konferansepaper, Publicerat paper (Fagfellevurdert)
Abstract [en]

Previous work has shown that speech disfluencies can negatively impact judgments about a speaker's competence and confidence. However, these effects have primarily been examined with Likert-type rating scales, which are not informative about how judgments might translate to behavior. Does the presence of disfluencies actually guide decision-making when listeners stand to gain concretely from making the correct choice? We sought to address this question with a web-based decision task in which participants were asked to choose between two conflicting sources of information. Our results suggest that listeners do take speech fluency into account when deciding who or what to believe.

sted, utgiver, år, opplag, sider
International Speech Communication Association, 2025
Emneord
decision-making, paralinguistics, speech perception, text-to-speech
HSV kategori
Identifikatorer
urn:nbn:se:kth:diva-372789 (URN)10.21437/Interspeech.2025-1990 (DOI)2-s2.0-105020069715 (Scopus ID)
Konferanse
26th Interspeech Conference 2025, Rotterdam, Netherlands, Kingdom of the, August 17-21, 2025
Merknad

QC 20251119

Tilgjengelig fra: 2025-11-19 Laget: 2025-11-19 Sist oppdatert: 2025-11-19bibliografisk kontrollert
Edlund, J., Tånnander, C., Le Maguer, S. & Wagner, P. (2024). Assessing the impact of contextual framing on subjective TTS quality. In: Interspeech 2024: . Paper presented at 25th Interspeech Conferece 2024, Kos Island, Greece, September 1-5, 2024 (pp. 1205-1209). International Speech Communication Association
Åpne denne publikasjonen i ny fane eller vindu >>Assessing the impact of contextual framing on subjective TTS quality
2024 (engelsk)Inngår i: Interspeech 2024, International Speech Communication Association , 2024, s. 1205-1209Konferansepaper, Publicerat paper (Fagfellevurdert)
Abstract [en]

Text-To-Speech (TTS) evaluations are habitually carried out without contextual and situational framing. Since humans adapt their speaking style to situation specific communicative needs, such evaluations may not generalize across situations. Without clearly defined framing, it is even unclear in which situations evaluation results hold at all. We test the hypothesized impact of framing on TTS evaluation in a crowdsourced MOS evaluation of four TTS voices, systematically varying (a) the intended TTS task (domestic humanoid robot, child's voice replacement, fiction audio books and long and information-rich texts) and (b) the framing of that task. The results show that framing differentiated MOS responses, with individual TTS performance varying significantly across tasks and framings. This corroborates the assumption that decontextualized MOS evaluations do not generalize, and suggests that TTS evaluations should not be reported without the type of framing that was employed, if any.

sted, utgiver, år, opplag, sider
International Speech Communication Association, 2024
Emneord
evaluation, framing, methodology, MOS
HSV kategori
Identifikatorer
urn:nbn:se:kth:diva-358870 (URN)10.21437/Interspeech.2024-781 (DOI)001331850101070 ()2-s2.0-85214812427 (Scopus ID)
Konferanse
25th Interspeech Conferece 2024, Kos Island, Greece, September 1-5, 2024
Merknad

QC 20250127

Tilgjengelig fra: 2025-01-23 Laget: 2025-01-23 Sist oppdatert: 2025-12-08bibliografisk kontrollert
Tånnander, C., Mehta, S., Beskow, J. & Edlund, J. (2024). Beyond graphemes and phonemes: continuous phonological features in neural text-to-speech synthesis. In: Interspeech 2024: . Paper presented at 25th Interspeech Conferece 2024, Kos Island, Greece, Sep 1 2024 - Sep 5 2024 (pp. 2815-2819). International Speech Communication Association
Åpne denne publikasjonen i ny fane eller vindu >>Beyond graphemes and phonemes: continuous phonological features in neural text-to-speech synthesis
2024 (engelsk)Inngår i: Interspeech 2024, International Speech Communication Association , 2024, s. 2815-2819Konferansepaper, Publicerat paper (Fagfellevurdert)
Abstract [en]

We introduce continuous phonological features as input to TTS with the dual objective of more precise control over phonological aspects and better potential for exploration of latent features in TTS models for speech science purposes. In our framework, the TTS is conditioned on continuous values between 0.0 and 1.0, where each phoneme has a specified position on each feature axis. We chose 11 features to represent US English and trained a voice with Matcha-TTS. Effectiveness was assessed by investigating two selected features in two ways: through a categorical perception experiment confirming the expected alignment of feature positions and phoneme perception, and through analysis of acoustic correlates confirming a gradual, monotonic change of acoustic features consistent with changes in the phonemic input features.

sted, utgiver, år, opplag, sider
International Speech Communication Association, 2024
Emneord
analysis-by-synthesis, controllable text-to-speech synthesis, phonological features
HSV kategori
Identifikatorer
urn:nbn:se:kth:diva-358877 (URN)10.21437/Interspeech.2024-1565 (DOI)001331850102192 ()2-s2.0-85214785956 (Scopus ID)
Konferanse
25th Interspeech Conferece 2024, Kos Island, Greece, Sep 1 2024 - Sep 5 2024
Merknad

QC 20250128

Tilgjengelig fra: 2025-01-23 Laget: 2025-01-23 Sist oppdatert: 2025-12-08bibliografisk kontrollert
Ekström, A. G., Gannon, C., Edlund, J., Moran, S. & Lameira, A. R. (2024). Chimpanzee utterances refute purported missing links for novel vocalizations and syllabic speech. Scientific Reports, 14(1), Article ID 17135.
Åpne denne publikasjonen i ny fane eller vindu >>Chimpanzee utterances refute purported missing links for novel vocalizations and syllabic speech
Vise andre…
2024 (engelsk)Inngår i: Scientific Reports, E-ISSN 2045-2322, Vol. 14, nr 1, artikkel-id 17135Artikkel i tidsskrift (Fagfellevurdert) Published
Abstract [en]

Nonhuman great apes have been claimed to be unable to learn human words due to a lack of the necessary neural circuitry. We recovered original footage of two enculturated chimpanzees uttering the word “mama” and subjected recordings to phonetic analysis. Our analyses demonstrate that chimpanzees are capable of syllabic production, achieving consonant-to-vowel phonetic contrasts via the simultaneous recruitment and coupling of voice, jaw and lips. In an online experiment, human listeners naive to the recordings’ origins reliably perceived chimpanzee utterances as syllabic utterances, primarily as “ma-ma”, among foil syllables. Our findings demonstrate that in the absence of direct data-driven examination, great ape vocal production capacities have been underestimated. Chimpanzees possess the neural building blocks necessary for speech.

sted, utgiver, år, opplag, sider
Springer Nature, 2024
Emneord
Phonetics, Primatology, Vocal learning
HSV kategori
Forskningsprogram
Tal- och musikkommunikation
Identifikatorer
urn:nbn:se:kth:diva-351240 (URN)10.1038/s41598-024-67005-w (DOI)001278002800007 ()39054330 (PubMedID)2-s2.0-85199430867 (Scopus ID)
Forskningsfinansiär
Swedish Research Council, 2017-00626KTH Royal Institute of Technology
Merknad

QC 20240805

Tilgjengelig fra: 2024-08-04 Laget: 2024-08-04 Sist oppdatert: 2024-08-27bibliografisk kontrollert
Tånnander, C., O'Regan, J., House, D., Edlund, J. & Beskow, J. (2024). Prosodic characteristics of English-accented Swedish neural TTS. In: Proceedings of Speech Prosody 2024: . Paper presented at Speech Prosody 2024, Leiden, The Netherlands, 2-5 July 2024 (pp. 1035-1039). Leiden, The Netherlands: International Speech Communication Association
Åpne denne publikasjonen i ny fane eller vindu >>Prosodic characteristics of English-accented Swedish neural TTS
Vise andre…
2024 (engelsk)Inngår i: Proceedings of Speech Prosody 2024, Leiden, The Netherlands: International Speech Communication Association , 2024, s. 1035-1039Konferansepaper, Publicerat paper (Fagfellevurdert)
Abstract [en]

Neural text-to-speech synthesis (TTS) captures prosodicfeatures strikingly well, notwithstanding the lack of prosodiclabels in training or synthesis. We trained a voice on a singleSwedish speaker reading in Swedish and English. The resultingTTS allows us to control the degree of English-accentedness inSwedish sentences. English-accented Swedish commonlyexhibits well-known prosodic characteristics such as erroneoustonal accents and understated or missed durational differences.TTS quality was verified in three ways. Automatic speechrecognition resulted in low errors, verifying intelligibility.Automatic language classification had Swedish as the majoritychoice, while the likelihood of English increased with ourtargeted degree of English-accentedness. Finally, a rank ofperceived English-accentedness acquired through pairwisecomparisons by 20 human listeners demonstrated a strongcorrelation with the targeted English-accentedness.We report on phonetic and prosodic analyses of theaccented TTS. In addition to the anticipated segmentaldifferences, the analyses revealed temporal and prominencerelated variations coherent with Swedish spoken by Englishspeakers, such as missing Swedish stress patterns and overlyreduced unstressed syllables. With this work, we aim to gleaninsights into speech prosody from the latent prosodic featuresof neural TTS models. In addition, it will help implementspeech phenomena such as code switching in TTS

sted, utgiver, år, opplag, sider
Leiden, The Netherlands: International Speech Communication Association, 2024
Emneord
foreign-accented text-to-speech synthesis, neural text-to-speech synthesis, latent prosodic features
HSV kategori
Forskningsprogram
Tal- och musikkommunikation
Identifikatorer
urn:nbn:se:kth:diva-349946 (URN)10.21437/SpeechProsody.2024-209 (DOI)2-s2.0-105008058763 (Scopus ID)
Konferanse
Speech Prosody 2024, Leiden, The Netherlands, 2-5 July 2024
Prosjekter
Deep learning based speech synthesis for reading aloud of lengthy and information rich texts in Swedish (2018-02427)Språkbanken Tal (2017-00626)
Forskningsfinansiär
Vinnova, (2018-02427
Merknad

QC 20240705

Tilgjengelig fra: 2024-07-03 Laget: 2024-07-03 Sist oppdatert: 2025-07-01bibliografisk kontrollert
Tånnander, C., Edlund, J. & Gustafsson, J. (2024). Revisiting Three Text-to-Speech Synthesis Experiments with a Web-Based Audience Response System. In: 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation, LREC-COLING 2024 - Main Conference Proceedings: . Paper presented at Joint 30th International Conference on Computational Linguistics and 14th International Conference on Language Resources and Evaluation, LREC-COLING 2024, Hybrid, Torino, Italy, May 20 2024 - May 25 2024 (pp. 14111-14121). European Language Resources Association (ELRA)
Åpne denne publikasjonen i ny fane eller vindu >>Revisiting Three Text-to-Speech Synthesis Experiments with a Web-Based Audience Response System
2024 (engelsk)Inngår i: 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation, LREC-COLING 2024 - Main Conference Proceedings, European Language Resources Association (ELRA) , 2024, s. 14111-14121Konferansepaper, Publicerat paper (Fagfellevurdert)
Abstract [en]

In order to investigate the strengths and weaknesses of Audience Response System (ARS) in text-to-speech synthesis (TTS) evaluations, we revisit three previously published TTS studies and perform an ARS-based evaluation on the stimuli used in each study. The experiments are performed with a participant pool of 39 respondents, using a web-based tool that emulates an ARS experiment. The results of the first experiment confirms that ARS is highly useful for evaluating long and continuous stimuli, particularly if we wish for a diagnostic result rather than a single overall metric, while the second and third experiments highlight weaknesses in ARS with unsuitable materials as well as the importance of framing and instruction when conducting ARS-based evaluation.

sted, utgiver, år, opplag, sider
European Language Resources Association (ELRA), 2024
Emneord
audience response system, evaluation methodology, TTS evaluation
HSV kategori
Identifikatorer
urn:nbn:se:kth:diva-348784 (URN)2-s2.0-85195897862 (Scopus ID)
Konferanse
Joint 30th International Conference on Computational Linguistics and 14th International Conference on Language Resources and Evaluation, LREC-COLING 2024, Hybrid, Torino, Italy, May 20 2024 - May 25 2024
Merknad

Part of ISBN 9782493814104

QC 20240701

Tilgjengelig fra: 2024-06-27 Laget: 2024-06-27 Sist oppdatert: 2025-02-07bibliografisk kontrollert
Ekström, A. G. & Edlund, J. (2024). Sketches of chimpanzee (Pan troglodytes) hoo’s: vowels by any other name?. Primates, 65(2), 81-88
Åpne denne publikasjonen i ny fane eller vindu >>Sketches of chimpanzee (Pan troglodytes) hoo’s: vowels by any other name?
2024 (engelsk)Inngår i: Primates, ISSN 0032-8332, E-ISSN 1610-7365, Vol. 65, nr 2, s. 81-88Artikkel i tidsskrift (Fagfellevurdert) Published
Abstract [en]

In human speech, the close back rounded vowel /u/ (the vowel in “boot”) is articulated with the tongue arched toward the dorsal boundary of the hard palate, with the pharyngeal cavity open. Acoustic and perceptual properties of chimpanzee (Pan troglodytes) hoo’s are similar to those of the human vowel /u/. However, the vocal tract morphology of chimpanzees likely limits their phonetic capabilities, so that it is unlikely, or even impossible, that their articulation is comparable to that of a human. To determine how qualities of the vowel /u/ may be achieved given the chimpanzee vocal tract, we calculated transfer functions of the vocal tract area for tube models of vocal tract configurations in which vocal tract length, length and area of a laryngeal air sac simulacrum, length of lip protrusion, and area of lip opening were systematically varied. The method described is principally acoustic; we make no claim as to the actual shape of the chimpanzee vocal tract during call production. Nonetheless, we demonstrate that it may be possible to achieve the acoustic and perceptual qualities of back vowels without a reconfigured human vocal tract. The results, while tentative, suggest that the production of hoo’s by chimpanzees, while achieving comparable vowel-like qualities to the human /u/, may involve articulatory gestures that are beyond the range of the human articulators. The purpose of this study was to (1) stimulate further simulation research on great ape articulation, and (2) show that apparently vowel-like phenomena in nature are not necessarily indicative of evolutionary continuity per se.

sted, utgiver, år, opplag, sider
Springer Nature, 2024
Emneord
Articulatory phonetics, Primatology, Speech acoustics, Vowel quality
HSV kategori
Identifikatorer
urn:nbn:se:kth:diva-367104 (URN)10.1007/s10329-023-01107-3 (DOI)001126347800001 ()38110671 (PubMedID)2-s2.0-85180178929 (Scopus ID)
Merknad

QC 20250715

Tilgjengelig fra: 2025-07-15 Laget: 2025-07-15 Sist oppdatert: 2025-07-15bibliografisk kontrollert
Organisasjoner
Identifikatorer
ORCID-id: ORCID iD iconorcid.org/0000-0001-9327-9482