kth.sePublications
Change search
Link to record
Permanent link

Direct link
Publications (10 of 109) Show all publications
Cai, H. & Ternström, S. (2025). A WaveNet-based model for predicting the electroglottographic signal from the acoustic voice signal. Journal of the Acoustical Society of America, 157(4), 3033-3044
Open this publication in new window or tab >>A WaveNet-based model for predicting the electroglottographic signal from the acoustic voice signal
2025 (English)In: Journal of the Acoustical Society of America, ISSN 0001-4966, E-ISSN 1520-8524, Vol. 157, no 4, p. 3033-3044Article in journal (Refereed) Published
Abstract [en]

The electroglottographic (EGG) signal offers a non-invasive approach to analyze phonation. It is known, if not obvious, that the onset of vocal fold contacting has a substantial effect on how the vocal folds vibrate and on the quality of the voice. Given that the presence or absence of vocal fold contacting has major consequences also for the interpretation of acoustic metrics, it is compelling to consider the possibility of predicting EGG signals directly from the microphone speech signal. This retrospective study presents a neural network model for EGG signal estimation utilizing a WaveNet architecture augmented with a self-attention mechanism. The model was trained on an existing dataset that comprehensively recorded participants' full voice range. The proposed model effectively captures the temporal dynamics and morphological characteristics of normophonic EGG waveforms, achieving outputs that closely resemble the ground truth in terms of EGG waveshape and extracted EGG metrics. For evaluation, voice mapping was used to display the distribution similarities of extracted metrics from predicted and ground truth EGG waveforms. The model exhibits proficiency in accurately estimating EGG signals in areas of stable and contacting voicing but displays reduced accuracy in transitional and breathy phonatory conditions.

Place, publisher, year, edition, pages
American Institute of Physics (AIP), 2025
Keywords
Phonetics, Vocalization, Vocal folds, Microphones, Speech analysis, Speech processing systems, Electroglottography, Acoustic signal processing, Artificial neural networks
National Category
Oto-rhino-laryngology Medical Instrumentation Signal Processing
Research subject
Speech and Music Communication
Identifiers
urn:nbn:se:kth:diva-362580 (URN)10.1121/10.0036514 (DOI)40249176 (PubMedID)2-s2.0-105003174138 (Scopus ID)
Note

A precursor to this article was included in Huanchen Cai's doctoral thesis. This is the revised and accepted version.

QC 20250425

Available from: 2025-04-20 Created: 2025-04-20 Last updated: 2025-05-27Bibliographically approved
Herbst, C. T., Tokuda, I. T., Nishimura, T., Ternström, S., Ossio, V., Levy, M., . . . Dunn, J. C. (2025). ‘Monkey yodels’—frequency jumps in New World monkey vocalizations greatly surpass human vocal register transitions. Philosophical Transactions of the Royal Society of London. Biological Sciences, 380(1923), Article ID 20240005.
Open this publication in new window or tab >>‘Monkey yodels’—frequency jumps in New World monkey vocalizations greatly surpass human vocal register transitions
Show others...
2025 (English)In: Philosophical Transactions of the Royal Society of London. Biological Sciences, ISSN 0962-8436, E-ISSN 1471-2970, Vol. 380, no 1923, article id 20240005Article in journal (Refereed) Published
Abstract [en]

We investigated the causal basis of abrupt frequency jumps in a unique database of New World monkey vocalizations. We used a combination of acoustic and electroglottographic recordings in vivo , excised larynx investigations of vocal fold dynamics, and computational modelling. We particularly attended to the contribution of the vocal membranes: thin upward extensions of the vocal folds found in most primates but absent in humans. In three of the six investigated species, we observed two distinct modes of vocal fold vibration. The first, involving vocal fold vibration alone, produced low-frequency oscillations, and is analogous to that underlying human phonation. The second, incorporating the vocal membranes, resulted in much higher-frequency oscillation. Abrupt fundamental frequency shifts were observed in all three datasets. While these data are reminiscent of the rapid transitions in frequency observed in certain human singing styles (e.g. yodelling), the frequency jumps are considerably larger in the nonhuman primates studied. Our data suggest that peripheral modifications of vocal anatomy provide an important source of variability and complexity in the vocal repertoires of nonhuman primates. We further propose that the call repertoire is crucially related to a species’ ability to vocalize with different laryngeal mechanisms, analogous to human vocal registers.

Place, publisher, year, edition, pages
The Royal Society, 2025
Keywords
vocal membrane, laryngeal mechanism, call repertoire, NLP vocalization, fundamental frequency contol
National Category
Oto-rhino-laryngology Applied Mechanics Structural Biology
Research subject
Speech and Music Communication
Identifiers
urn:nbn:se:kth:diva-219581 (URN)10.1098/rstb.2024.0005 (DOI)001461623200021 ()40176522 (PubMedID)2-s2.0-105001836522 (Scopus ID)
Note

QC 20250520

Available from: 2025-04-03 Created: 2025-04-03 Last updated: 2025-05-20Bibliographically approved
Ternström, S., Bernardoni, N. H., Birkholz, P., Guasch, O. & Gully, A. (Eds.). (2024). Computational Analysis and Simulation of the Human Voice (Dagstuhl Seminar 24242). Paper presented at Dagstuhl Seminar 24242. Schloss Dagstuhl – Leibniz-Zentrum für Informatik, 14(6)
Open this publication in new window or tab >>Computational Analysis and Simulation of the Human Voice (Dagstuhl Seminar 24242)
Show others...
2024 (English)Conference proceedings (editor) (Other academic)
Abstract [en]

This report documents the program and the outcomes of Dagstuhl Seminar 24242 "Computational Analysis and Simulation of the Human Voice", which was held from the 9th to the 14th of June, 2024. The seminar addressed key issues for a better understanding of the human voice by focusing on four main areas: voice analysis, visualisation techniques, simulation methods, and data analysis with machine learning. There has been enormous progress in recent years in all these fields. The seminar brought together a number of experts from fields as diverse as computer science, logopedics and phoniatrics, clinicians, acoustics and audio engineering, electronics, musicology, speech and hearing sciences, physics and mathematics. The schedule was quite flexible, including inspirational talks in the main areas, interactive working groups, sharing of conclusions and discussions, presentation of successes and failures to learn from, and a large number of free talks that emerged throughout the days. The variety of topics and participants created a highly enriching environment from which novel proposals for future research and collaboration emerged, as well as the collective writing of a paper on the state of the art and future perspectives in human voice research.

Place, publisher, year, edition, pages
Schloss Dagstuhl – Leibniz-Zentrum für Informatik, 2024. p. 24
Series
Dagstuhl Reports, ISSN 2192-5283 ; 14
Keywords
voice analysis, voice simulation, voice visualization
National Category
Bioinformatics (Computational Biology) Other Computer and Information Science
Research subject
Speech and Music Communication
Identifiers
urn:nbn:se:kth:diva-357969 (URN)10.4230/DagRep.14.6.84 (DOI)
Conference
Dagstuhl Seminar 24242
Note

QC 20250113

Available from: 2024-12-21 Created: 2024-12-21 Last updated: 2025-01-13Bibliographically approved
Iob, N. A., He, L., Ternström, S., Cai, H. & Brockmann-Bauser, M. (2024). Effects of Speech Characteristics on Electroglottographic and Instrumental Acoustic Voice Analysis Metrics in Women With Structural Dysphonia Before and After Treatment. Journal of Speech, Language and Hearing Research, 67(6), 1660-1681
Open this publication in new window or tab >>Effects of Speech Characteristics on Electroglottographic and Instrumental Acoustic Voice Analysis Metrics in Women With Structural Dysphonia Before and After Treatment
Show others...
2024 (English)In: Journal of Speech, Language and Hearing Research, ISSN 1092-4388, E-ISSN 1558-9102, Vol. 67, no 6, p. 1660-1681Article in journal (Refereed) Published
Abstract [en]

Purpose: Literature suggests a dependency of the acoustic metrics, smoothed cepstral peak prominence (CPPS) and harmonics-to-noise ratio (HNR), on human voice loudness and fundamental frequency (fo). Even though this has been explained with different oscillatory patterns of the vocal folds, so far, it has not been specifically investigated. In the present work, the influence of three elicitation levels, calibrated sound pressure level (SPL), fo and vowel on the electroglottographic (EGG) and time-differentiated EGG (dEGG) metrics hybrid open quotient (OQ), dEGG OQ and peak dEGG, as well as on the acous-tic metrics CPPS and HNR, was examined, and their suitability for voice assess-ment was evaluated. Method: In a retrospective study, 29 women with a mean age of 25 years (± 8.9, range: 18–53) diagnosed with structural vocal fold pathologies were examined before and after voice therapy or phonosurgery. Both acoustic and EGG signals were recorded simultaneously during the phonation of the sustained vowels /ɑ/, /i/, and /u/ at three elicited levels of loudness (soft/comfortable/loud) and unconstrained fo conditions. Results: A linear mixed-model analysis showed a significant effect of elicitation effort levels on peak dEGG, HNR, and CPPS (all p < .01). Calibrated SPL significantly influenced HNR and CPPS (both p < .01). Furthermore, F0had asignificant effect on peak dEGG and CPPS (p < .0001). All metrics showed significant changes with regard to vowel (all p < .05). However, the treatment had no effect on the examined metrics, regardless of the treatment type (surgery vs. voice therapy). Conclusions: The value of the investigated metrics for voice assessment purposes when sampled without sufficient control of SPL and fo is limited, in that they are significantly influenced by the phonatory context, be it speech or elicited sustained vowels. Future studies should explore the diagnostic value of new data collation approaches such as voice mapping, which take SPL and fo effects into account.

Place, publisher, year, edition, pages
American Speech Language Hearing Association, 2024
National Category
Otorhinolaryngology
Research subject
Speech and Music Communication
Identifiers
urn:nbn:se:kth:diva-346605 (URN)10.1044/2024_JSLHR-23-00253 (DOI)001245110000002 ()38758676 (PubMedID)2-s2.0-85192238446 (Scopus ID)
Note

QC 20240703

Available from: 2024-05-20 Created: 2024-05-20 Last updated: 2025-02-21Bibliographically approved
Cai, H., Ternström, S., Chaffanjon, P. & Henrich Bernardoni, N. (2024). Effects on Voice Quality of Thyroidectomy: A Qualitative and Quantitative Study Using Voice Maps. Journal of Voice
Open this publication in new window or tab >>Effects on Voice Quality of Thyroidectomy: A Qualitative and Quantitative Study Using Voice Maps
2024 (English)In: Journal of Voice, ISSN 0892-1997, E-ISSN 1873-4588Article in journal (Refereed) Epub ahead of print
Abstract [en]

Objectives: This study aims to explore the effects of thyroidectomy—a surgical intervention involving the removal of the thyroid gland—on voice quality, as represented by acoustic and electroglottographic measures. Given the thyroid gland's proximity to the inferior and superior laryngeal nerves, thyroidectomy carries a potential risk of affecting vocal function. While earlier studies have documented effects on the voice range, few studies have looked at voice quality after thyroidectomy. Since voice quality effects could manifest in many ways, that a priori are unknown, we wish to apply an exploratory approach that collects many data points from several metrics.

Methods: A voice-mapping analysis paradigm was applied retrospectively on a corpus of spoken and sung sentences produced by patients who had thyroid surgery. Voice quality changes were assessed objectively for 57 patients prior to surgery and 2 months after surgery, by making comparative voice maps, pre- and post-intervention, of six acoustic and electroglottographic (EGG) metrics.

Results: After thyroidectomy, statistically significant changes consistent with a worsening of voice quality were observed in most metrics. For all individual metrics, however, the effect sizes were too small to be clinically relevant. Statistical clustering of the metrics helped to clarify the nature of these changes. While partial thyroidectomy demonstrated greater uniformity than did total thyroidectomy, the type of perioperative damage had no discernible impact on voice quality.ConclusionsChanges in voice quality after thyroidectomy were related mostly to increased phonatory instability in both the acoustic and EGG metrics. Clustered voice metrics exhibited a higher correlation to voice complaints than did individual voice metrics.

Place, publisher, year, edition, pages
Elsevier, 2024
Keywords
thyroidectomy, voice quality, electroglottography, voice classification, voice mapping
National Category
Otorhinolaryngology
Research subject
Speech and Music Communication
Identifiers
urn:nbn:se:kth:diva-346224 (URN)10.1016/j.jvoice.2024.03.012 (DOI)2-s2.0-85192255370 (Scopus ID)
Funder
KTH Royal Institute of Technology, 6308
Note

QC 20240508

Available from: 2024-05-07 Created: 2024-05-07 Last updated: 2025-02-21Bibliographically approved
Engström, H., Włodarczak, M. & Ternström, S. (2024). Mapping the effect of body position: Voice quality differences in connected speech. In: Proceedings of FONETIK 2024, Stockholm, June 3-€“5, 2024: . Paper presented at FONETIK 2024, Stockholm, June 3-5, 2024 (pp. 21-26). Stockholm Univeristy
Open this publication in new window or tab >>Mapping the effect of body position: Voice quality differences in connected speech
2024 (English)In: Proceedings of FONETIK 2024, Stockholm, June 3-€“5, 2024, Stockholm Univeristy , 2024, p. 21-26Conference paper, Published paper (Refereed)
Abstract [en]

This work investigates the effect of body position on voice quality, based on cepstral peak prominence (CPP) and spectrum balance (SB) metrics layered on a mapped speech range profile (SRP) across a sound pressure level (SPL) and fundamental frequency (fo) plane. Eight participants were tested in an upright position, supine position at 0º and an inverted position at -10º. Findings show varied and small changes in voice quality in connected speech between positions and that effects may occur at specific SPL and fo ranges among some participants.

Place, publisher, year, edition, pages
Stockholm Univeristy, 2024
Keywords
phonation, respiratory plethysmography, body position, electroglottography, voice analysis
National Category
General Language Studies and Linguistics
Research subject
Speech and Music Communication
Identifiers
urn:nbn:se:kth:diva-352421 (URN)10.5281/zenodo.11396054 (DOI)
Conference
FONETIK 2024, Stockholm, June 3-5, 2024
Note

This conference paper is a summary of the lead author's Bachelor thesis, which can be found at https://www.diva-portal.org/smash/record.jsf?dswid=6218&pid=diva2%3A1768562

QC 20240902

Available from: 2024-09-01 Created: 2024-09-01 Last updated: 2024-09-02Bibliographically approved
Ternström, S. (2024). Pragmatic De-Noising of Electroglottographic Signals. Bioengineering, 11(5), 479
Open this publication in new window or tab >>Pragmatic De-Noising of Electroglottographic Signals
2024 (English)In: Bioengineering, E-ISSN 2306-5354, Vol. 11, no 5, p. 479-Article in journal (Refereed) Published
Abstract [en]

In voice analysis, the electroglottographic (EGG) signal has long been recognized as a useful complement to the acoustic signal, but only when the vocal folds are actually contacting, such that this signal has an appreciable amplitude. However, phonation can also occur without the vocal folds contacting, as in breathy voice, in which case the EGG amplitude is low, but not zero. It is of great interest to identify the transition from non-contacting to contacting, because this will substantially change the nature of the vocal fold oscillations; however, that transition is not in itself audible. The magnitude of the cycle-normalized peak derivative of the EGG signal is a convenient indicator of vocal fold contacting, but no current EGG hardware has a sufficient signal-to-noise ratio of the derivative. We show how the textbook techniques of spectral thresholding and static notch filtering are straightforward to implement, can run in real time, and can mitigate several noise problems in EGG hardware. This can be useful to researchers in vocology.

Place, publisher, year, edition, pages
MDPI AG, 2024
Keywords
electroglottography, de-noising, contact quotient, peak dEGG, spectral thresholding; notch filtering
National Category
Medical Instrumentation Signal Processing Otorhinolaryngology
Research subject
Speech and Music Communication
Identifiers
urn:nbn:se:kth:diva-346291 (URN)10.3390/bioengineering11050479 (DOI)001233023700001 ()2-s2.0-85194385596 (Scopus ID)
Funder
KTH Royal Institute of Technology, 6308
Note

QC 20240513

Available from: 2024-05-11 Created: 2024-05-11 Last updated: 2025-02-10Bibliographically approved
Körner Gustafsson, J., Södersten, M., Ternström, S. & Schalling, E. (2024). Treatment of Hypophonia in Parkinson’s Disease Through Biofeedback in Daily Life Administered with A Portable Voice Accumulator. Journal of Voice, 38(3), 800.e27-800.e38
Open this publication in new window or tab >>Treatment of Hypophonia in Parkinson’s Disease Through Biofeedback in Daily Life Administered with A Portable Voice Accumulator
2024 (English)In: Journal of Voice, ISSN 0892-1997, E-ISSN 1873-4588, Vol. 38, no 3, p. 800.e27-800.e38Article in journal (Refereed) Published
Abstract [en]

Objectives

The purpose of this study was to assess the outcome following continuous tactile biofeedback of voice sound level administered, with a portable voice accumulator to individuals with Parkinson's disease (PD).

Method

Nine out of 16 participants with PD completed a 4-week intervention program where biofeedback of voice sound level was administered with the portable voice accumulator VoxLog during speech in daily life. The feedback, a tactile vibration signal from the device, was activated when the wearer used a voice sound level below an individually predetermined threshold level, reminding the wearer to increase voice sound level during speech. Voice use was registered in daily life with the VoxLog during the intervention period as well as during one baseline week, one follow-up week post intervention and 1 week 3 months post intervention. Self-to-other ratio (SOR), which is the difference between voice sound level and environmental noise, was studied in multiple noise ranges.

Results

A significant increase in SOR across all noise ranges of 2.28 dB (SD: 0.55) was seen for participants with scores above the cut-off for normal function (>26 points) on the cognitive screening test Montreal Cognitive Assessment (MoCA) (n = 5). No significant increase was seen for the group of participants with MoCA scores below 26 (n = 4). Forty-four percent ended their participation early, all which scored below 26 on MoCA (n = 7).

Conclusions

Biofeedback administered in daily life regarding voice level may help individuals with PD to increase their voice sound level in relation to environmental noise in daily life, but only for a limited subset. Only participants with normal cognitive function as screened by MoCA improved their voice sound level in relation to environmental noise.

Place, publisher, year, edition, pages
Elsevier, 2024
Keywords
voice use, Parkinson's disease, portable voice accumulator, biofeedback
National Category
Neurology
Research subject
Speech and Music Communication
Identifiers
urn:nbn:se:kth:diva-307122 (URN)10.1016/j.jvoice.2021.10.009 (DOI)001239982200001 ()34893384 (PubMedID)2-s2.0-85121732742 (Scopus ID)
Funder
Swedish Association of Persons with Neurological DisabilitiesPromobilia foundation
Note

QC 20240619

Available from: 2022-01-12 Created: 2022-01-12 Last updated: 2024-08-28Bibliographically approved
Ternström, S. (2024). Update 3.1 to FonaDyn: A system for real-time analysis of the electroglottogram, over the voice range. SoftwareX, 26
Open this publication in new window or tab >>Update 3.1 to FonaDyn: A system for real-time analysis of the electroglottogram, over the voice range
2024 (English)In: SoftwareX, E-ISSN 2352-7110, Vol. 26Article in journal (Refereed) Published
Abstract [en]

The human voice is notoriously variable, and conventional measurement paradigms are weak in terms of providing evidence for effects of treatment and/or training of voices. New methods are needed that can take into account the variability of metrics and types of phonation across the voice range. The “voice map” is a generalization of the Voice Range Profile (a.k.a. the phonetogram), with the potential to be used in many ways, for teaching, training, therapy and research. FonaDyn is intended as a proof-of concept workbench for education and research on phonation, and for exploring and validating the analysis paradigm of voice-mapping. Version 3.1 of the FonaDyn system adds many new functions, including listening from maps; displaying multiple maps and difference maps to track effects of voice interventions; smoothing/interpolation of voice maps; clustering not only of EGG shapes but also of acoustic and EGG metrics into phonation types; extended multichannel acquisition;24-bit recording with optional max 140 dB SPL; a built-in SPL calibration and signal diagnostics tool; EGG noise suppression; more Matlab integration; script control; the acoustic metrics Spectrum Balance, Cepstral Peak Prominence and Harmonic Richness Factor (of the EGG); and better window layout control. Stability and usability are further improved. Apple M-series processors are now supported natively.

Place, publisher, year, edition, pages
Elsevier BV, 2024
Keywords
Voice mapping Electroglottography, Real-time analysis, Voice range profile, Phonation types, Supercollider
National Category
Otorhinolaryngology Biomedical Laboratory Science/Technology Medical Laboratory Technologies
Research subject
Speech and Music Communication
Identifiers
urn:nbn:se:kth:diva-343437 (URN)10.1016/j.softx.2024.101653 (DOI)001187872400001 ()
Funder
Swedish Research Council, 2010-4565
Note

QC 20240214

Available from: 2024-02-14 Created: 2024-02-14 Last updated: 2025-02-09Bibliographically approved
D'Amario, S., Ternström, S., Goebl, W. & Bishop, L. (2023). Body motion of choral singers. Frontiers in Psychology, 14
Open this publication in new window or tab >>Body motion of choral singers
2023 (English)In: Frontiers in Psychology, E-ISSN 1664-1078, Vol. 14Article in journal (Refereed) Published
Abstract [en]

Recent investigations on music performances have shown the relevance of singers’ body motion for pedagogical as well as performance purposes. However, little is known about how the perception of voice-matching or task complexity affects choristers’ body motion during ensemble singing. This study focussed on the body motion of choral singers who perform in duo along with a pre-recorded tune presented over a loudspeaker. Specifically, we examined the effects of the perception of voice-matching, operationalized in terms of sound spectral envelope, and task complexity on choristers’ body motion. Fifteen singers with advanced choral experience first manipulated the spectral components of a pre-recorded short tune composed for the study, by choosing the settings they felt most and least together with. Then, they performed the tune in unison (i.e., singing the same melody simultaneously) and in canon (i.e., singing the same melody but at a temporal delay) with the chosen filter settings. Motion data of the choristers’ upper body and audio of the repeated performances were collected and analyzed. Results show that the settings perceived as least together relate to extreme differences between the spectral components of the sound. The singers’ wrists and torso motion was more periodic, their upper body posture was more open, and their bodies were more distant from the music stand when singing in unison than in canon. These findings suggest that unison singing promotes an expressive-periodic motion of the upper body.

Place, publisher, year, edition, pages
Frontiers Media SA, 2023
Keywords
togetherness, ensemble singing, motion capture, joint-actions, music perception, flow, voice matching
National Category
Musicology
Research subject
Speech and Music Communication
Identifiers
urn:nbn:se:kth:diva-341573 (URN)10.3389/fpsyg.2023.1220904 (DOI)001136436500001 ()2-s2.0-85181732914 (Scopus ID)
Funder
EU, Horizon 2020, 101108755
Note

QC 20231228

Available from: 2023-12-22 Created: 2023-12-22 Last updated: 2024-03-18Bibliographically approved
Organisations
Identifiers
ORCID iD: ORCID iD iconorcid.org/0000-0002-3362-7518

Search in DiVA

Show all publications