kth.sePublications KTH
Change search
Link to record
Permanent link

Direct link
Publications (10 of 114) Show all publications
Cai, H. & Ternström, S. (2025). A WaveNet-based model for predicting the electroglottographic signal from the acoustic voice signal. Journal of the Acoustical Society of America, 157(4), 3033-3044
Open this publication in new window or tab >>A WaveNet-based model for predicting the electroglottographic signal from the acoustic voice signal
2025 (English)In: Journal of the Acoustical Society of America, ISSN 0001-4966, E-ISSN 1520-8524, Vol. 157, no 4, p. 3033-3044Article in journal (Refereed) Published
Abstract [en]

The electroglottographic (EGG) signal offers a non-invasive approach to analyze phonation. It is known, if not obvious, that the onset of vocal fold contacting has a substantial effect on how the vocal folds vibrate and on the quality of the voice. Given that the presence or absence of vocal fold contacting has major consequences also for the interpretation of acoustic metrics, it is compelling to consider the possibility of predicting EGG signals directly from the microphone speech signal. This retrospective study presents a neural network model for EGG signal estimation utilizing a WaveNet architecture augmented with a self-attention mechanism. The model was trained on an existing dataset that comprehensively recorded participants' full voice range. The proposed model effectively captures the temporal dynamics and morphological characteristics of normophonic EGG waveforms, achieving outputs that closely resemble the ground truth in terms of EGG waveshape and extracted EGG metrics. For evaluation, voice mapping was used to display the distribution similarities of extracted metrics from predicted and ground truth EGG waveforms. The model exhibits proficiency in accurately estimating EGG signals in areas of stable and contacting voicing but displays reduced accuracy in transitional and breathy phonatory conditions.

Place, publisher, year, edition, pages
American Institute of Physics (AIP), 2025
Keywords
Phonetics, Vocalization, Vocal folds, Microphones, Speech analysis, Speech processing systems, Electroglottography, Acoustic signal processing, Artificial neural networks
National Category
Oto-rhino-laryngology Medical Instrumentation Signal Processing
Research subject
Speech and Music Communication
Identifiers
urn:nbn:se:kth:diva-362580 (URN)10.1121/10.0036514 (DOI)001472395500002 ()40249176 (PubMedID)2-s2.0-105003174138 (Scopus ID)
Note

A precursor to this article was included in Huanchen Cai's doctoral thesis. This is the revised and accepted version.

QC 20250425

Available from: 2025-04-20 Created: 2025-04-20 Last updated: 2025-12-08Bibliographically approved
Ternström, S. & Pabon, P. (2025). From Voice Signals to Voice Maps. International Journal of Voice Sciences
Open this publication in new window or tab >>From Voice Signals to Voice Maps
2025 (English)In: International Journal of Voice Sciences, E-ISSN 3054-4343Article in journal (Refereed) Epub ahead of print
Abstract [en]

This article is intended as an introductory tutorial for technically inclined clinicians, vocologists and voice pedagogues who want to understand the principles and potentials of voice mapping. Voice mapping has its origins in the Voice Range Profile, or phonetogram, but it is less concerned with the extremes of the voice range, and more with what happens within a relevant range of the voice. It is a voice instrumentation paradigm that is intended to improve the evidential value of voice measurements. It exposes and automatically accounts for the strong co-variation that most voice metrics exhibit with fundamental frequency and sound level. Very many data points are automatically collected in a short time, and their means are mapped by colour onto maps. This results in a robust representation of voice status and function. While individual voices are very different, a voice map’s appearance is reproducible within individuals. Comparing maps across interventions gives rich information, even on subtle changes in a voice. Further, by automatically clustering multiple metrics, phonation types can be identified and mapped automatically, which can increase the clinical relevance, and facilitate a better understanding of voice data in general.

Place, publisher, year, edition, pages
Hildesheim/Holzminden/Göttingen: Paradigm Publishers, 2025
Keywords
Voice map, voice range profile, voice measurement, electroglottography, clinical evidence
National Category
Medical Instrumentation Oto-rhino-laryngology Signal Processing
Research subject
Speech and Music Communication
Identifiers
urn:nbn:se:kth:diva-376855 (URN)10.2478/ijvs-2025-0002 (DOI)
Projects
Språkbanken Tal; HumInfra
Note

QC 20260218

Available from: 2026-02-18 Created: 2026-02-18 Last updated: 2026-02-18Bibliographically approved
Park, M., Ontakhrai, S., Kittimathaveenan, K., Alfredsson, J. & Ternström, S. (2025). How to make closed-back headphones transparent for avocalist’s own direct sound. In: : . Paper presented at AES 159th Convention 2025 October 23–25, Long Beach, CA, USA (pp. 8). Audio Engineering Society, Inc., Article ID 371.
Open this publication in new window or tab >>How to make closed-back headphones transparent for avocalist’s own direct sound
Show others...
2025 (English)Conference paper, Published paper (Refereed)
Abstract [en]

In stage acoustics research, it is common to use virtual acoustic environments over headphones to simulate various room conditions for musicians. When experiments are conducted in suboptimal physical environments (e.g., withoutan anechoic chamber), it is often challenging to reduce the inherent reverberation of the test room while ensuring that musicians can hear their own direct sound through the headphones as if the headphones were transparent. In the present study, two methods were developed and tested using an acoustic dummy head, with the aim of faithfully replicating the direct sound of a solo singer over a pair of closed-back headphones. The results showed that the first method - creating and applying an exact finite-impulse-response (FIR) filter - may lead to undesirable effects, primarily due to the inherent delay of the playback system. The second method, which utilized a multiband equalizer, proved more effective when evaluated with stationary broadband noise. For non-stationary, real-world sounds such as singing and speaking voices, the comparison between the reference sound and the signal processed by the multiband equalizer remained reasonably accurate, with differences typically less than ~1 dB across much of the frequency range. Future work may further refine and evaluate the proposed methods through listening tests.

Place, publisher, year, edition, pages
Audio Engineering Society, Inc., 2025
Series
AES E-Library
Keywords
singing, headphones, hearing-of-self, acoustic transparency
National Category
Signal Processing Music Other Electrical Engineering, Electronic Engineering, Information Engineering
Research subject
Speech and Music Communication
Identifiers
urn:nbn:se:kth:diva-373098 (URN)
Conference
AES 159th Convention 2025 October 23–25, Long Beach, CA, USA
Note

AES 159th Convention Express Paper 371.

This study was supported by the King Mongkut’s Institute of Technology Ladkrabang research grant (KREF046818), the Karl Engver Foundation, and OCSC, Thailand. 

QC 20251219

Available from: 2025-11-18 Created: 2025-11-18 Last updated: 2025-12-19Bibliographically approved
Herbst, C. T., Tokuda, I. T., Nishimura, T., Ternström, S., Ossio, V., Levy, M., . . . Dunn, J. C. (2025). ‘Monkey yodels’—frequency jumps in New World monkey vocalizations greatly surpass human vocal register transitions. Philosophical Transactions of the Royal Society of London. Biological Sciences, 380(1923), Article ID 20240005.
Open this publication in new window or tab >>‘Monkey yodels’—frequency jumps in New World monkey vocalizations greatly surpass human vocal register transitions
Show others...
2025 (English)In: Philosophical Transactions of the Royal Society of London. Biological Sciences, ISSN 0962-8436, E-ISSN 1471-2970, Vol. 380, no 1923, article id 20240005Article in journal (Refereed) Published
Abstract [en]

We investigated the causal basis of abrupt frequency jumps in a unique database of New World monkey vocalizations. We used a combination of acoustic and electroglottographic recordings in vivo , excised larynx investigations of vocal fold dynamics, and computational modelling. We particularly attended to the contribution of the vocal membranes: thin upward extensions of the vocal folds found in most primates but absent in humans. In three of the six investigated species, we observed two distinct modes of vocal fold vibration. The first, involving vocal fold vibration alone, produced low-frequency oscillations, and is analogous to that underlying human phonation. The second, incorporating the vocal membranes, resulted in much higher-frequency oscillation. Abrupt fundamental frequency shifts were observed in all three datasets. While these data are reminiscent of the rapid transitions in frequency observed in certain human singing styles (e.g. yodelling), the frequency jumps are considerably larger in the nonhuman primates studied. Our data suggest that peripheral modifications of vocal anatomy provide an important source of variability and complexity in the vocal repertoires of nonhuman primates. We further propose that the call repertoire is crucially related to a species’ ability to vocalize with different laryngeal mechanisms, analogous to human vocal registers.

Place, publisher, year, edition, pages
The Royal Society, 2025
Keywords
vocal membrane, laryngeal mechanism, call repertoire, NLP vocalization, fundamental frequency contol
National Category
Oto-rhino-laryngology Applied Mechanics Structural Biology
Research subject
Speech and Music Communication
Identifiers
urn:nbn:se:kth:diva-219581 (URN)10.1098/rstb.2024.0005 (DOI)001461623200021 ()40176522 (PubMedID)2-s2.0-105001836522 (Scopus ID)
Note

QC 20250520

Available from: 2025-04-03 Created: 2025-04-03 Last updated: 2025-05-20Bibliographically approved
Capobianco, S., Björck, G., Forli, F., Berrettini, S. & Ternström, S. (2025). Voice mapping in clinical practice: tracking objective changes after injection laryngoplasty. Otorinolaryngologie a foniatrie, 74(S1), s28-s28
Open this publication in new window or tab >>Voice mapping in clinical practice: tracking objective changes after injection laryngoplasty
Show others...
2025 (English)In: Otorinolaryngologie a foniatrie, ISSN 1210-7867, Vol. 74, no S1, p. s28-s28Article in journal (Refereed) Published
National Category
Oto-rhino-laryngology
Identifiers
urn:nbn:se:kth:diva-374602 (URN)10.48095/ccorl2025s1_46 (DOI)
Note

QC 20251219

Available from: 2025-12-19 Created: 2025-12-19 Last updated: 2025-12-19Bibliographically approved
Capobianco, S., Björck, G., Forli, F., Bruschini, L., Nacci, A. & Ternström, S. (2025). Voice mapping in clinical practice: Tracking objective changes after injection laryngoplasty. In: Frassineti, L Lanata, A Manfredi, C (Ed.), Models and analysis of vocal emissions for biomedical applications: . Paper presented at 14th International Workshop on MODELS AND ANALYSIS OF VOCAL EMISSIONS FOR BIOMEDICAL APPLICATIONS-MAVEBA, DEC 16-17, 2025, Firenze, ITALY (pp. 15-18). Firenze Univ Press, 139
Open this publication in new window or tab >>Voice mapping in clinical practice: Tracking objective changes after injection laryngoplasty
Show others...
2025 (English)In: Models and analysis of vocal emissions for biomedical applications / [ed] Frassineti, L Lanata, A Manfredi, C, Firenze Univ Press , 2025, Vol. 139, p. 15-18Conference paper, Published paper (Refereed)
Abstract [en]

Objective: to explore the use of voice mapping for assessing changes in phonatory function following injection laryngoplasty in patients with unilateral vocal fold paralysis (UVFP). Materials and methods: Two patient cohorts were analyzed. Cohort 1 (N=8) received in-office injections of hyaluronic acid or calcium hydroxylapatite, with voice recordings acquired before and immediately after treatment. Cohort 2 (N=4) underwent autologous fat injection under general anesthesia, with follow-ups at 1 and 3 months. All patients completed standard speech tasks with simultaneous acquisition of acoustic and electroglottographic (EGG) signals. Voice maps were computed using the FonaDyn system. Perceptual GRBAS ratings were provided by three blinded expert raters. Results: Voice mapping was feasible in all patients and revealed consistent treatment effects. Across both cohorts, the cycle-rate sample entropy (CSE) decreased, while the normalized peak dEGG (Q(Delta)) and the Index of Contacting (I-c) both increased, indicating improved phonatory stability and vocal fold contact. Perceptual ratings showed corresponding reductions in breathiness and overall dysphonia. Conclusions: The voice map representation clearly visualized and quantified phonatory changes post-treatment for UVFP, with potential applications in clinical monitoring and outcome evaluation.

Place, publisher, year, edition, pages
Firenze Univ Press, 2025
Series
Proceedings E Report, ISSN 2704-601X
Keywords
vocal fold paralysis, injection laryngoplasty, voice mapping, electroglottography, vocal fold contact
National Category
Oto-rhino-laryngology
Identifiers
urn:nbn:se:kth:diva-379481 (URN)001686443700001 ()
Conference
14th International Workshop on MODELS AND ANALYSIS OF VOCAL EMISSIONS FOR BIOMEDICAL APPLICATIONS-MAVEBA, DEC 16-17, 2025, Firenze, ITALY
Note

Part of ISBN 979-12-215-0820-8; 979-12-215-0821-5

QC 20260416

Available from: 2026-04-16 Created: 2026-04-16 Last updated: 2026-04-16Bibliographically approved
Ternström, S. & Pabon, P. (2025). "Voice Range Profile" or "Voice Map"?: On terms, rationales and techniques. In: L. Frassineti, A. Lanatà, C. Manfredi (Ed.), Models and Analysis of Vocal Emissions for Biomedical Applications: 14th International Workshop. Paper presented at 14th MAVEBA Workshop, 16-17 Dec, Florence, Italy (pp. 135-138). Firenze, Italy: Firenze University Press (FUP)
Open this publication in new window or tab >>"Voice Range Profile" or "Voice Map"?: On terms, rationales and techniques
2025 (English)In: Models and Analysis of Vocal Emissions for Biomedical Applications: 14th International Workshop / [ed] L. Frassineti, A. Lanatà, C. Manfredi, Firenze, Italy: Firenze University Press (FUP), 2025, p. 135-138Conference paper, Published paper (Refereed)
Abstract [en]

Let “voice range profile” (a.k.a, “phonetogram”) be the term for a graph of the maximum phonatory range of a voice on the fo×SPL plane, i.e., a closed contour. Let “voice map” be the term for a map of a scalar metric over some relevant range,not necessarily to the extremes, on that same plane, i.e., a 2D scalar field. For imaging several metrics, one voice map can have several “layers”, all derived from the same recording. This paradigm for collection and collation of voice data is useful, because it accounts for how the chosen metrics vary systematically with fo and SPL. Both fo and SPL are influential and typically nonlinear covariates of other voice metrics. Not accounting for them can obscure the effects of an intervention. Here we summarize some central concepts, rationales and techniques related to voice mapping.

Place, publisher, year, edition, pages
Firenze, Italy: Firenze University Press (FUP), 2025
Series
Models and Analysis of Vocal Emissions for Biomedical Applications, ISSN 2704-601X, E-ISSN ISSN 2704-5846 ; 139
Keywords
voice analysis, voice map, voice range profile, electroglottography
National Category
Signal Processing
Research subject
Speech and Music Communication
Identifiers
urn:nbn:se:kth:diva-374335 (URN)
Conference
14th MAVEBA Workshop, 16-17 Dec, Florence, Italy
Projects
Språkbanken Tal
Funder
Swedish Research Council
Note

Part of ISBN 9791221508208, 9791221508215

QC 20251218

Available from: 2025-12-17 Created: 2025-12-17 Last updated: 2026-02-25Bibliographically approved
Ternström, S., Bernardoni, N. H., Birkholz, P., Guasch, O. & Gully, A. (Eds.). (2024). Computational Analysis and Simulation of the Human Voice (Dagstuhl Seminar 24242). Paper presented at Dagstuhl Seminar 24242. Schloss Dagstuhl – Leibniz-Zentrum für Informatik, 14(6)
Open this publication in new window or tab >>Computational Analysis and Simulation of the Human Voice (Dagstuhl Seminar 24242)
Show others...
2024 (English)Conference proceedings (editor) (Other academic)
Abstract [en]

This report documents the program and the outcomes of Dagstuhl Seminar 24242 "Computational Analysis and Simulation of the Human Voice", which was held from the 9th to the 14th of June, 2024. The seminar addressed key issues for a better understanding of the human voice by focusing on four main areas: voice analysis, visualisation techniques, simulation methods, and data analysis with machine learning. There has been enormous progress in recent years in all these fields. The seminar brought together a number of experts from fields as diverse as computer science, logopedics and phoniatrics, clinicians, acoustics and audio engineering, electronics, musicology, speech and hearing sciences, physics and mathematics. The schedule was quite flexible, including inspirational talks in the main areas, interactive working groups, sharing of conclusions and discussions, presentation of successes and failures to learn from, and a large number of free talks that emerged throughout the days. The variety of topics and participants created a highly enriching environment from which novel proposals for future research and collaboration emerged, as well as the collective writing of a paper on the state of the art and future perspectives in human voice research.

Place, publisher, year, edition, pages
Schloss Dagstuhl – Leibniz-Zentrum für Informatik, 2024. p. 24
Series
Dagstuhl Reports, ISSN 2192-5283 ; 14
Keywords
voice analysis, voice simulation, voice visualization
National Category
Bioinformatics (Computational Biology) Other Computer and Information Science
Research subject
Speech and Music Communication
Identifiers
urn:nbn:se:kth:diva-357969 (URN)10.4230/DagRep.14.6.84 (DOI)
Conference
Dagstuhl Seminar 24242
Note

QC 20250113

Available from: 2024-12-21 Created: 2024-12-21 Last updated: 2025-01-13Bibliographically approved
Iob, N. A., He, L., Ternström, S., Cai, H. & Brockmann-Bauser, M. (2024). Effects of Speech Characteristics on Electroglottographic and Instrumental Acoustic Voice Analysis Metrics in Women With Structural Dysphonia Before and After Treatment. Journal of Speech, Language and Hearing Research, 67(6), 1660-1681
Open this publication in new window or tab >>Effects of Speech Characteristics on Electroglottographic and Instrumental Acoustic Voice Analysis Metrics in Women With Structural Dysphonia Before and After Treatment
Show others...
2024 (English)In: Journal of Speech, Language and Hearing Research, ISSN 1092-4388, E-ISSN 1558-9102, Vol. 67, no 6, p. 1660-1681Article in journal (Refereed) Published
Abstract [en]

Purpose: Literature suggests a dependency of the acoustic metrics, smoothed cepstral peak prominence (CPPS) and harmonics-to-noise ratio (HNR), on human voice loudness and fundamental frequency (fo). Even though this has been explained with different oscillatory patterns of the vocal folds, so far, it has not been specifically investigated. In the present work, the influence of three elicitation levels, calibrated sound pressure level (SPL), fo and vowel on the electroglottographic (EGG) and time-differentiated EGG (dEGG) metrics hybrid open quotient (OQ), dEGG OQ and peak dEGG, as well as on the acous-tic metrics CPPS and HNR, was examined, and their suitability for voice assess-ment was evaluated. Method: In a retrospective study, 29 women with a mean age of 25 years (± 8.9, range: 18–53) diagnosed with structural vocal fold pathologies were examined before and after voice therapy or phonosurgery. Both acoustic and EGG signals were recorded simultaneously during the phonation of the sustained vowels /ɑ/, /i/, and /u/ at three elicited levels of loudness (soft/comfortable/loud) and unconstrained fo conditions. Results: A linear mixed-model analysis showed a significant effect of elicitation effort levels on peak dEGG, HNR, and CPPS (all p < .01). Calibrated SPL significantly influenced HNR and CPPS (both p < .01). Furthermore, F0had asignificant effect on peak dEGG and CPPS (p < .0001). All metrics showed significant changes with regard to vowel (all p < .05). However, the treatment had no effect on the examined metrics, regardless of the treatment type (surgery vs. voice therapy). Conclusions: The value of the investigated metrics for voice assessment purposes when sampled without sufficient control of SPL and fo is limited, in that they are significantly influenced by the phonatory context, be it speech or elicited sustained vowels. Future studies should explore the diagnostic value of new data collation approaches such as voice mapping, which take SPL and fo effects into account.

Place, publisher, year, edition, pages
American Speech Language Hearing Association, 2024
National Category
Otorhinolaryngology
Research subject
Speech and Music Communication
Identifiers
urn:nbn:se:kth:diva-346605 (URN)10.1044/2024_JSLHR-23-00253 (DOI)001245110000002 ()38758676 (PubMedID)2-s2.0-85192238446 (Scopus ID)
Note

QC 20240703

Available from: 2024-05-20 Created: 2024-05-20 Last updated: 2025-02-21Bibliographically approved
Cai, H., Ternström, S., Chaffanjon, P. & Henrich Bernardoni, N. (2024). Effects on Voice Quality of Thyroidectomy: A Qualitative and Quantitative Study Using Voice Maps. Journal of Voice
Open this publication in new window or tab >>Effects on Voice Quality of Thyroidectomy: A Qualitative and Quantitative Study Using Voice Maps
2024 (English)In: Journal of Voice, ISSN 0892-1997, E-ISSN 1873-4588Article in journal (Refereed) Epub ahead of print
Abstract [en]

Objectives: This study aims to explore the effects of thyroidectomy—a surgical intervention involving the removal of the thyroid gland—on voice quality, as represented by acoustic and electroglottographic measures. Given the thyroid gland's proximity to the inferior and superior laryngeal nerves, thyroidectomy carries a potential risk of affecting vocal function. While earlier studies have documented effects on the voice range, few studies have looked at voice quality after thyroidectomy. Since voice quality effects could manifest in many ways, that a priori are unknown, we wish to apply an exploratory approach that collects many data points from several metrics.

Methods: A voice-mapping analysis paradigm was applied retrospectively on a corpus of spoken and sung sentences produced by patients who had thyroid surgery. Voice quality changes were assessed objectively for 57 patients prior to surgery and 2 months after surgery, by making comparative voice maps, pre- and post-intervention, of six acoustic and electroglottographic (EGG) metrics.

Results: After thyroidectomy, statistically significant changes consistent with a worsening of voice quality were observed in most metrics. For all individual metrics, however, the effect sizes were too small to be clinically relevant. Statistical clustering of the metrics helped to clarify the nature of these changes. While partial thyroidectomy demonstrated greater uniformity than did total thyroidectomy, the type of perioperative damage had no discernible impact on voice quality.ConclusionsChanges in voice quality after thyroidectomy were related mostly to increased phonatory instability in both the acoustic and EGG metrics. Clustered voice metrics exhibited a higher correlation to voice complaints than did individual voice metrics.

Place, publisher, year, edition, pages
Elsevier, 2024
Keywords
thyroidectomy, voice quality, electroglottography, voice classification, voice mapping
National Category
Otorhinolaryngology
Research subject
Speech and Music Communication
Identifiers
urn:nbn:se:kth:diva-346224 (URN)10.1016/j.jvoice.2024.03.012 (DOI)2-s2.0-85192255370 (Scopus ID)
Funder
KTH Royal Institute of Technology, 6308
Note

QC 20240508

Available from: 2024-05-07 Created: 2024-05-07 Last updated: 2025-02-21Bibliographically approved
Organisations
Identifiers
ORCID iD: ORCID iD iconorcid.org/0000-0002-3362-7518

Search in DiVA

Show all publications