kth.sePublications KTH
Operational message
There are currently operational disruptions. Troubleshooting is in progress.
Change search
Link to record
Permanent link

Direct link
Publications (10 of 112) Show all publications
Cai, H. & Ternström, S. (2025). A WaveNet-based model for predicting the electroglottographic signal from the acoustic voice signal. Journal of the Acoustical Society of America, 157(4), 3033-3044
Open this publication in new window or tab >>A WaveNet-based model for predicting the electroglottographic signal from the acoustic voice signal
2025 (English)In: Journal of the Acoustical Society of America, ISSN 0001-4966, E-ISSN 1520-8524, Vol. 157, no 4, p. 3033-3044Article in journal (Refereed) Published
Abstract [en]

The electroglottographic (EGG) signal offers a non-invasive approach to analyze phonation. It is known, if not obvious, that the onset of vocal fold contacting has a substantial effect on how the vocal folds vibrate and on the quality of the voice. Given that the presence or absence of vocal fold contacting has major consequences also for the interpretation of acoustic metrics, it is compelling to consider the possibility of predicting EGG signals directly from the microphone speech signal. This retrospective study presents a neural network model for EGG signal estimation utilizing a WaveNet architecture augmented with a self-attention mechanism. The model was trained on an existing dataset that comprehensively recorded participants' full voice range. The proposed model effectively captures the temporal dynamics and morphological characteristics of normophonic EGG waveforms, achieving outputs that closely resemble the ground truth in terms of EGG waveshape and extracted EGG metrics. For evaluation, voice mapping was used to display the distribution similarities of extracted metrics from predicted and ground truth EGG waveforms. The model exhibits proficiency in accurately estimating EGG signals in areas of stable and contacting voicing but displays reduced accuracy in transitional and breathy phonatory conditions.

Place, publisher, year, edition, pages
American Institute of Physics (AIP), 2025
Keywords
Phonetics, Vocalization, Vocal folds, Microphones, Speech analysis, Speech processing systems, Electroglottography, Acoustic signal processing, Artificial neural networks
National Category
Oto-rhino-laryngology Medical Instrumentation Signal Processing
Research subject
Speech and Music Communication
Identifiers
urn:nbn:se:kth:diva-362580 (URN)10.1121/10.0036514 (DOI)001472395500002 ()40249176 (PubMedID)2-s2.0-105003174138 (Scopus ID)
Note

A precursor to this article was included in Huanchen Cai's doctoral thesis. This is the revised and accepted version.

QC 20250425

Available from: 2025-04-20 Created: 2025-04-20 Last updated: 2025-12-08Bibliographically approved
Park, M., Ontakhrai, S., Kittimathaveenan, K., Alfredsson, J. & Ternström, S. (2025). How to make closed-back headphones transparent for avocalist’s own direct sound. In: : . Paper presented at AES 159th Convention 2025 October 23–25, Long Beach, CA, USA (pp. 8). Audio Engineering Society, Inc., Article ID 371.
Open this publication in new window or tab >>How to make closed-back headphones transparent for avocalist’s own direct sound
Show others...
2025 (English)Conference paper, Published paper (Refereed)
Abstract [en]

In stage acoustics research, it is common to use virtual acoustic environments over headphones to simulate various room conditions for musicians. When experiments are conducted in suboptimal physical environments (e.g., withoutan anechoic chamber), it is often challenging to reduce the inherent reverberation of the test room while ensuring that musicians can hear their own direct sound through the headphones as if the headphones were transparent. In the present study, two methods were developed and tested using an acoustic dummy head, with the aim of faithfully replicating the direct sound of a solo singer over a pair of closed-back headphones. The results showed that the first method - creating and applying an exact finite-impulse-response (FIR) filter - may lead to undesirable effects, primarily due to the inherent delay of the playback system. The second method, which utilized a multiband equalizer, proved more effective when evaluated with stationary broadband noise. For non-stationary, real-world sounds such as singing and speaking voices, the comparison between the reference sound and the signal processed by the multiband equalizer remained reasonably accurate, with differences typically less than ~1 dB across much of the frequency range. Future work may further refine and evaluate the proposed methods through listening tests.

Place, publisher, year, edition, pages
Audio Engineering Society, Inc., 2025
Series
AES E-Library
Keywords
singing, headphones, hearing-of-self, acoustic transparency
National Category
Signal Processing Music Other Electrical Engineering, Electronic Engineering, Information Engineering
Research subject
Speech and Music Communication
Identifiers
urn:nbn:se:kth:diva-373098 (URN)
Conference
AES 159th Convention 2025 October 23–25, Long Beach, CA, USA
Note

AES 159th Convention Express Paper 371.

This study was supported by the King Mongkut’s Institute of Technology Ladkrabang research grant (KREF046818), the Karl Engver Foundation, and OCSC, Thailand. 

QC 20251219

Available from: 2025-11-18 Created: 2025-11-18 Last updated: 2025-12-19Bibliographically approved
Herbst, C. T., Tokuda, I. T., Nishimura, T., Ternström, S., Ossio, V., Levy, M., . . . Dunn, J. C. (2025). ‘Monkey yodels’—frequency jumps in New World monkey vocalizations greatly surpass human vocal register transitions. Philosophical Transactions of the Royal Society of London. Biological Sciences, 380(1923), Article ID 20240005.
Open this publication in new window or tab >>‘Monkey yodels’—frequency jumps in New World monkey vocalizations greatly surpass human vocal register transitions
Show others...
2025 (English)In: Philosophical Transactions of the Royal Society of London. Biological Sciences, ISSN 0962-8436, E-ISSN 1471-2970, Vol. 380, no 1923, article id 20240005Article in journal (Refereed) Published
Abstract [en]

We investigated the causal basis of abrupt frequency jumps in a unique database of New World monkey vocalizations. We used a combination of acoustic and electroglottographic recordings in vivo , excised larynx investigations of vocal fold dynamics, and computational modelling. We particularly attended to the contribution of the vocal membranes: thin upward extensions of the vocal folds found in most primates but absent in humans. In three of the six investigated species, we observed two distinct modes of vocal fold vibration. The first, involving vocal fold vibration alone, produced low-frequency oscillations, and is analogous to that underlying human phonation. The second, incorporating the vocal membranes, resulted in much higher-frequency oscillation. Abrupt fundamental frequency shifts were observed in all three datasets. While these data are reminiscent of the rapid transitions in frequency observed in certain human singing styles (e.g. yodelling), the frequency jumps are considerably larger in the nonhuman primates studied. Our data suggest that peripheral modifications of vocal anatomy provide an important source of variability and complexity in the vocal repertoires of nonhuman primates. We further propose that the call repertoire is crucially related to a species’ ability to vocalize with different laryngeal mechanisms, analogous to human vocal registers.

Place, publisher, year, edition, pages
The Royal Society, 2025
Keywords
vocal membrane, laryngeal mechanism, call repertoire, NLP vocalization, fundamental frequency contol
National Category
Oto-rhino-laryngology Applied Mechanics Structural Biology
Research subject
Speech and Music Communication
Identifiers
urn:nbn:se:kth:diva-219581 (URN)10.1098/rstb.2024.0005 (DOI)001461623200021 ()40176522 (PubMedID)2-s2.0-105001836522 (Scopus ID)
Note

QC 20250520

Available from: 2025-04-03 Created: 2025-04-03 Last updated: 2025-05-20Bibliographically approved
Capobianco, S., Björck, G., Forli, F., Berrettini, S. & Ternström, S. (2025). Voice mapping in clinical practice: tracking objective changes after injection laryngoplasty. Otorinolaryngologie a foniatrie, 74(S1), s28-s28
Open this publication in new window or tab >>Voice mapping in clinical practice: tracking objective changes after injection laryngoplasty
Show others...
2025 (English)In: Otorinolaryngologie a foniatrie, ISSN 1210-7867, Vol. 74, no S1, p. s28-s28Article in journal (Refereed) Published
National Category
Oto-rhino-laryngology
Identifiers
urn:nbn:se:kth:diva-374602 (URN)10.48095/ccorl2025s1_46 (DOI)
Note

QC 20251219

Available from: 2025-12-19 Created: 2025-12-19 Last updated: 2025-12-19Bibliographically approved
Ternström, S. & Pabon, P. (2025). "Voice Range Profile" or "Voice Map"?: On terms, rationales and techniques. In: L. Frassineti, A. Lanatà, C. Manfredi (Ed.), Models and Analysis of Vocal Emissions for Biomedical Applications: 14th International Workshop. Paper presented at 14th MAVEBA Workshop, 16-17 Dec, Florence, Italy (pp. 135-138). Firenze, Italy: Firenze University Press (FUP)
Open this publication in new window or tab >>"Voice Range Profile" or "Voice Map"?: On terms, rationales and techniques
2025 (English)In: Models and Analysis of Vocal Emissions for Biomedical Applications: 14th International Workshop / [ed] L. Frassineti, A. Lanatà, C. Manfredi, Firenze, Italy: Firenze University Press (FUP), 2025, p. 135-138Conference paper, Published paper (Refereed)
Abstract [en]

Let “voice range profile” (a.k.a, “phonetogram”) be the term for a graph of the maximum phonatory range of a voice on the fo×SPL plane, i.e., a closed contour. Let “voice map” be the term for a map of a scalar metric over some relevant range,not necessarily to the extremes, on that same plane, i.e., a 2D scalar field. For imaging several metrics, one voice map can have several “layers”, all derived from the same recording. This paradigm for collection and collation of voice data is useful, because it accounts for how the chosen metrics vary systematically with fo and SPL. Both fo and SPL are influential and typically nonlinear covariates of other voice metrics. Not accounting for them can obscure the effects of an intervention. Here we summarize some central concepts, rationales and techniques related to voice mapping.

Place, publisher, year, edition, pages
Firenze, Italy: Firenze University Press (FUP), 2025
Series
Models and Analysis of Vocal Emissions for Biomedical Applications, ISSN 2704-601X, E-ISSN ISSN 2704-5846 ; 139
Keywords
voice analysis, voice map, voice range profile, electroglottography
National Category
Signal Processing
Research subject
Speech and Music Communication
Identifiers
urn:nbn:se:kth:diva-374335 (URN)
Conference
14th MAVEBA Workshop, 16-17 Dec, Florence, Italy
Projects
Språkbanken Tal
Note

Part of ISBN 9791221508208, 9791221508215

QC 20251218

Available from: 2025-12-17 Created: 2025-12-17 Last updated: 2025-12-22Bibliographically approved
Ternström, S., Bernardoni, N. H., Birkholz, P., Guasch, O. & Gully, A. (Eds.). (2024). Computational Analysis and Simulation of the Human Voice (Dagstuhl Seminar 24242). Paper presented at Dagstuhl Seminar 24242. Schloss Dagstuhl – Leibniz-Zentrum für Informatik, 14(6)
Open this publication in new window or tab >>Computational Analysis and Simulation of the Human Voice (Dagstuhl Seminar 24242)
Show others...
2024 (English)Conference proceedings (editor) (Other academic)
Abstract [en]

This report documents the program and the outcomes of Dagstuhl Seminar 24242 "Computational Analysis and Simulation of the Human Voice", which was held from the 9th to the 14th of June, 2024. The seminar addressed key issues for a better understanding of the human voice by focusing on four main areas: voice analysis, visualisation techniques, simulation methods, and data analysis with machine learning. There has been enormous progress in recent years in all these fields. The seminar brought together a number of experts from fields as diverse as computer science, logopedics and phoniatrics, clinicians, acoustics and audio engineering, electronics, musicology, speech and hearing sciences, physics and mathematics. The schedule was quite flexible, including inspirational talks in the main areas, interactive working groups, sharing of conclusions and discussions, presentation of successes and failures to learn from, and a large number of free talks that emerged throughout the days. The variety of topics and participants created a highly enriching environment from which novel proposals for future research and collaboration emerged, as well as the collective writing of a paper on the state of the art and future perspectives in human voice research.

Place, publisher, year, edition, pages
Schloss Dagstuhl – Leibniz-Zentrum für Informatik, 2024. p. 24
Series
Dagstuhl Reports, ISSN 2192-5283 ; 14
Keywords
voice analysis, voice simulation, voice visualization
National Category
Bioinformatics (Computational Biology) Other Computer and Information Science
Research subject
Speech and Music Communication
Identifiers
urn:nbn:se:kth:diva-357969 (URN)10.4230/DagRep.14.6.84 (DOI)
Conference
Dagstuhl Seminar 24242
Note

QC 20250113

Available from: 2024-12-21 Created: 2024-12-21 Last updated: 2025-01-13Bibliographically approved
Iob, N. A., He, L., Ternström, S., Cai, H. & Brockmann-Bauser, M. (2024). Effects of Speech Characteristics on Electroglottographic and Instrumental Acoustic Voice Analysis Metrics in Women With Structural Dysphonia Before and After Treatment. Journal of Speech, Language and Hearing Research, 67(6), 1660-1681
Open this publication in new window or tab >>Effects of Speech Characteristics on Electroglottographic and Instrumental Acoustic Voice Analysis Metrics in Women With Structural Dysphonia Before and After Treatment
Show others...
2024 (English)In: Journal of Speech, Language and Hearing Research, ISSN 1092-4388, E-ISSN 1558-9102, Vol. 67, no 6, p. 1660-1681Article in journal (Refereed) Published
Abstract [en]

Purpose: Literature suggests a dependency of the acoustic metrics, smoothed cepstral peak prominence (CPPS) and harmonics-to-noise ratio (HNR), on human voice loudness and fundamental frequency (fo). Even though this has been explained with different oscillatory patterns of the vocal folds, so far, it has not been specifically investigated. In the present work, the influence of three elicitation levels, calibrated sound pressure level (SPL), fo and vowel on the electroglottographic (EGG) and time-differentiated EGG (dEGG) metrics hybrid open quotient (OQ), dEGG OQ and peak dEGG, as well as on the acous-tic metrics CPPS and HNR, was examined, and their suitability for voice assess-ment was evaluated. Method: In a retrospective study, 29 women with a mean age of 25 years (± 8.9, range: 18–53) diagnosed with structural vocal fold pathologies were examined before and after voice therapy or phonosurgery. Both acoustic and EGG signals were recorded simultaneously during the phonation of the sustained vowels /ɑ/, /i/, and /u/ at three elicited levels of loudness (soft/comfortable/loud) and unconstrained fo conditions. Results: A linear mixed-model analysis showed a significant effect of elicitation effort levels on peak dEGG, HNR, and CPPS (all p < .01). Calibrated SPL significantly influenced HNR and CPPS (both p < .01). Furthermore, F0had asignificant effect on peak dEGG and CPPS (p < .0001). All metrics showed significant changes with regard to vowel (all p < .05). However, the treatment had no effect on the examined metrics, regardless of the treatment type (surgery vs. voice therapy). Conclusions: The value of the investigated metrics for voice assessment purposes when sampled without sufficient control of SPL and fo is limited, in that they are significantly influenced by the phonatory context, be it speech or elicited sustained vowels. Future studies should explore the diagnostic value of new data collation approaches such as voice mapping, which take SPL and fo effects into account.

Place, publisher, year, edition, pages
American Speech Language Hearing Association, 2024
National Category
Otorhinolaryngology
Research subject
Speech and Music Communication
Identifiers
urn:nbn:se:kth:diva-346605 (URN)10.1044/2024_JSLHR-23-00253 (DOI)001245110000002 ()38758676 (PubMedID)2-s2.0-85192238446 (Scopus ID)
Note

QC 20240703

Available from: 2024-05-20 Created: 2024-05-20 Last updated: 2025-02-21Bibliographically approved
Cai, H., Ternström, S., Chaffanjon, P. & Henrich Bernardoni, N. (2024). Effects on Voice Quality of Thyroidectomy: A Qualitative and Quantitative Study Using Voice Maps. Journal of Voice
Open this publication in new window or tab >>Effects on Voice Quality of Thyroidectomy: A Qualitative and Quantitative Study Using Voice Maps
2024 (English)In: Journal of Voice, ISSN 0892-1997, E-ISSN 1873-4588Article in journal (Refereed) Epub ahead of print
Abstract [en]

Objectives: This study aims to explore the effects of thyroidectomy—a surgical intervention involving the removal of the thyroid gland—on voice quality, as represented by acoustic and electroglottographic measures. Given the thyroid gland's proximity to the inferior and superior laryngeal nerves, thyroidectomy carries a potential risk of affecting vocal function. While earlier studies have documented effects on the voice range, few studies have looked at voice quality after thyroidectomy. Since voice quality effects could manifest in many ways, that a priori are unknown, we wish to apply an exploratory approach that collects many data points from several metrics.

Methods: A voice-mapping analysis paradigm was applied retrospectively on a corpus of spoken and sung sentences produced by patients who had thyroid surgery. Voice quality changes were assessed objectively for 57 patients prior to surgery and 2 months after surgery, by making comparative voice maps, pre- and post-intervention, of six acoustic and electroglottographic (EGG) metrics.

Results: After thyroidectomy, statistically significant changes consistent with a worsening of voice quality were observed in most metrics. For all individual metrics, however, the effect sizes were too small to be clinically relevant. Statistical clustering of the metrics helped to clarify the nature of these changes. While partial thyroidectomy demonstrated greater uniformity than did total thyroidectomy, the type of perioperative damage had no discernible impact on voice quality.ConclusionsChanges in voice quality after thyroidectomy were related mostly to increased phonatory instability in both the acoustic and EGG metrics. Clustered voice metrics exhibited a higher correlation to voice complaints than did individual voice metrics.

Place, publisher, year, edition, pages
Elsevier, 2024
Keywords
thyroidectomy, voice quality, electroglottography, voice classification, voice mapping
National Category
Otorhinolaryngology
Research subject
Speech and Music Communication
Identifiers
urn:nbn:se:kth:diva-346224 (URN)10.1016/j.jvoice.2024.03.012 (DOI)2-s2.0-85192255370 (Scopus ID)
Funder
KTH Royal Institute of Technology, 6308
Note

QC 20240508

Available from: 2024-05-07 Created: 2024-05-07 Last updated: 2025-02-21Bibliographically approved
Engström, H., Włodarczak, M. & Ternström, S. (2024). Mapping the effect of body position: Voice quality differences in connected speech. In: Proceedings of FONETIK 2024, Stockholm, June 3-€“5, 2024: . Paper presented at FONETIK 2024, Stockholm, June 3-5, 2024 (pp. 21-26). Stockholm Univeristy
Open this publication in new window or tab >>Mapping the effect of body position: Voice quality differences in connected speech
2024 (English)In: Proceedings of FONETIK 2024, Stockholm, June 3-€“5, 2024, Stockholm Univeristy , 2024, p. 21-26Conference paper, Published paper (Refereed)
Abstract [en]

This work investigates the effect of body position on voice quality, based on cepstral peak prominence (CPP) and spectrum balance (SB) metrics layered on a mapped speech range profile (SRP) across a sound pressure level (SPL) and fundamental frequency (fo) plane. Eight participants were tested in an upright position, supine position at 0º and an inverted position at -10º. Findings show varied and small changes in voice quality in connected speech between positions and that effects may occur at specific SPL and fo ranges among some participants.

Place, publisher, year, edition, pages
Stockholm Univeristy, 2024
Keywords
phonation, respiratory plethysmography, body position, electroglottography, voice analysis
National Category
General Language Studies and Linguistics
Research subject
Speech and Music Communication
Identifiers
urn:nbn:se:kth:diva-352421 (URN)10.5281/zenodo.11396054 (DOI)
Conference
FONETIK 2024, Stockholm, June 3-5, 2024
Note

This conference paper is a summary of the lead author's Bachelor thesis, which can be found at https://www.diva-portal.org/smash/record.jsf?dswid=6218&pid=diva2%3A1768562

QC 20240902

Available from: 2024-09-01 Created: 2024-09-01 Last updated: 2024-09-02Bibliographically approved
Ternström, S. (2024). Pragmatic De-Noising of Electroglottographic Signals. Bioengineering, 11(5), 479
Open this publication in new window or tab >>Pragmatic De-Noising of Electroglottographic Signals
2024 (English)In: Bioengineering, E-ISSN 2306-5354, Vol. 11, no 5, p. 479-Article in journal (Refereed) Published
Abstract [en]

In voice analysis, the electroglottographic (EGG) signal has long been recognized as a useful complement to the acoustic signal, but only when the vocal folds are actually contacting, such that this signal has an appreciable amplitude. However, phonation can also occur without the vocal folds contacting, as in breathy voice, in which case the EGG amplitude is low, but not zero. It is of great interest to identify the transition from non-contacting to contacting, because this will substantially change the nature of the vocal fold oscillations; however, that transition is not in itself audible. The magnitude of the cycle-normalized peak derivative of the EGG signal is a convenient indicator of vocal fold contacting, but no current EGG hardware has a sufficient signal-to-noise ratio of the derivative. We show how the textbook techniques of spectral thresholding and static notch filtering are straightforward to implement, can run in real time, and can mitigate several noise problems in EGG hardware. This can be useful to researchers in vocology.

Place, publisher, year, edition, pages
MDPI AG, 2024
Keywords
electroglottography, de-noising, contact quotient, peak dEGG, spectral thresholding; notch filtering
National Category
Medical Instrumentation Signal Processing Otorhinolaryngology
Research subject
Speech and Music Communication
Identifiers
urn:nbn:se:kth:diva-346291 (URN)10.3390/bioengineering11050479 (DOI)001233023700001 ()2-s2.0-85194385596 (Scopus ID)
Funder
KTH Royal Institute of Technology, 6308
Note

QC 20240513

Available from: 2024-05-11 Created: 2024-05-11 Last updated: 2025-02-10Bibliographically approved
Organisations
Identifiers
ORCID iD: ORCID iD iconorcid.org/0000-0002-3362-7518

Search in DiVA

Show all publications