kth.sePublications
Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
A WaveNet-based model for predicting the electroglottographic signal from the acoustic voice signal
KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH. (Music Acoustics)ORCID iD: 0000-0003-0700-7216
KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH, Speech Communication and Technology.ORCID iD: 0000-0002-3362-7518
2025 (English)In: Journal of the Acoustical Society of America, ISSN 0001-4966, E-ISSN 1520-8524, Vol. 157, no 4, p. 3033-3044Article in journal (Refereed) Published
Abstract [en]

The electroglottographic (EGG) signal offers a non-invasive approach to analyze phonation. It is known, if not obvious, that the onset of vocal fold contacting has a substantial effect on how the vocal folds vibrate and on the quality of the voice. Given that the presence or absence of vocal fold contacting has major consequences also for the interpretation of acoustic metrics, it is compelling to consider the possibility of predicting EGG signals directly from the microphone speech signal. This retrospective study presents a neural network model for EGG signal estimation utilizing a WaveNet architecture augmented with a self-attention mechanism. The model was trained on an existing dataset that comprehensively recorded participants' full voice range. The proposed model effectively captures the temporal dynamics and morphological characteristics of normophonic EGG waveforms, achieving outputs that closely resemble the ground truth in terms of EGG waveshape and extracted EGG metrics. For evaluation, voice mapping was used to display the distribution similarities of extracted metrics from predicted and ground truth EGG waveforms. The model exhibits proficiency in accurately estimating EGG signals in areas of stable and contacting voicing but displays reduced accuracy in transitional and breathy phonatory conditions.

Place, publisher, year, edition, pages
American Institute of Physics (AIP), 2025. Vol. 157, no 4, p. 3033-3044
Keywords [en]
Phonetics, Vocalization, Vocal folds, Microphones, Speech analysis, Speech processing systems, Electroglottography, Acoustic signal processing, Artificial neural networks
National Category
Oto-rhino-laryngology Medical Instrumentation Signal Processing
Research subject
Speech and Music Communication
Identifiers
URN: urn:nbn:se:kth:diva-362580DOI: 10.1121/10.0036514PubMedID: 40249176OAI: oai:DiVA.org:kth-362580DiVA, id: diva2:1953319
Note

A precursor to this article was included in Huanchen Cai's doctoral thesis. This is the revised and accepted version.

QC 20250425

Available from: 2025-04-20 Created: 2025-04-20 Last updated: 2025-04-25Bibliographically approved

Open Access in DiVA

No full text in DiVA

Other links

Publisher's full textPubMedJASA

Authority records

Cai, HuanchenTernström, Sten

Search in DiVA

By author/editor
Cai, HuanchenTernström, Sten
By organisation
Speech, Music and Hearing, TMHSpeech Communication and Technology
In the same journal
Journal of the Acoustical Society of America
Oto-rhino-laryngologyMedical InstrumentationSignal Processing

Search outside of DiVA

GoogleGoogle Scholar

doi
pubmed
urn-nbn

Altmetric score

doi
pubmed
urn-nbn
Total: 24 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf