kth.sePublications KTH
Operational message
There are currently operational disruptions. Troubleshooting is in progress.
Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Speech Data Augmentation for Improving Phoneme Transcriptions of Aphasic Speech Using Wav2Vec 2.0 for the PSST Challenge
KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.
KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.
KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.
KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.
Show others and affiliations
2022 (English)In: The RaPID4 Workshop: Resources and ProcessIng of linguistic, para-linguistic and extra-linguistic Data from people with various forms of cognitive/psychiatric/developmental impairments / [ed] Dimitrios Kokkinakis, Charalambos K. Themistocleous, Kristina Lundholm Fors, Athanasios Tsanas, Kathleen C. Fraser, Marseille, France, 2022, p. 62-70Conference paper, Published paper (Refereed)
Abstract [en]

As part of the PSST challenge, we explore how data augmentations, data sources, and model size affect phoneme transcription accuracy on speech produced by individuals with aphasia. We evaluate model performance in terms of feature error rate (FER) and phoneme error rate (PER). We find that data augmentations techniques, such as pitch shift, improve model performance. Additionally, increasing the size of the model decreases FER and PER. Our experiments also show that adding manually-transcribed speech from non-aphasic speakers (TIMIT) improves performance when Room Impulse Response is used to augment the data. The best performing model combines aphasic and non-aphasic data and has a 21.0% PER and a 9.2% FER, a relative improvement of 9.8% compared to the baseline model on the primary outcome measurement. We show that data augmentation, larger model size, and additional non-aphasic data sources can be helpful in improving automatic phoneme recognition models for people with aphasia.

Place, publisher, year, edition, pages
Marseille, France, 2022. p. 62-70
Keywords [en]
aphasia, data augmentation, phoneme transcription, phonemes, speech, speech data augmentation, wav2vec 2.0
National Category
Other Electrical Engineering, Electronic Engineering, Information Engineering
Research subject
Speech and Music Communication
Identifiers
URN: urn:nbn:se:kth:diva-314262Scopus ID: 2-s2.0-85145876107OAI: oai:DiVA.org:kth-314262DiVA, id: diva2:1671565
Conference
4th RaPID Workshop: Resources and Processing of Linguistic, Para-Linguistic and Extra-Linguistic Data from People with Various Forms of Cognitive/Psychiatric/Developmental Impairments, RAPID 2022, Marseille, France, Jun 25 2022
Note

QC 20220815

Available from: 2022-06-17 Created: 2022-06-17 Last updated: 2025-10-17Bibliographically approved
In thesis
1. Evaluation of Artificial Intelligence in the Medical Domain: Speech, Language and Applications
Open this publication in new window or tab >>Evaluation of Artificial Intelligence in the Medical Domain: Speech, Language and Applications
2025 (English)Doctoral thesis, comprehensive summary (Other academic)
Abstract [en]

This doctoral thesis investigates the potential of advanced speech and languagetechnologies, driven by deep learning, to improve clinical diagnostics and patientcare, primarily within the Swedish healthcare context. The research encompasseseight key papers, which are presented across three main sections:(1) Data Capture and Machine Learning for Speech: This section explores the use ofmultimodal data and advanced speech processing techniques for clinical applications.It includes research on utilizing multimodal data capture (speech, gaze, and digitalpen input) from clinical interviews to identify potential digital biomarkers for theearly detection and differentiation of dementia (Paper A). It also develops anautomated deep learning system to evaluate the oral diadochokinesis test for motorspeech disorders, which demonstrates higher accuracy than human raters andproposes a human-in-the-loop clinical interface (Paper B). Furthermore, this sectionevaluates the performance of Automatic Speech Recognition (ASR) systems,comparing word error rates between native (L1) and non-native (L2) Swedishspeakers (Paper C), and investigates data augmentation techniques to improve ASRaccuracy for individuals with aphasia, demonstrating a path towards more inclusivetechnology (Paper D).(2) Evaluation of LLMs in the Medical Domain: This section focuses on establishingrobust methods for assessing Large Language Models (LLMs) within a medicalcontext. It details the development of a specialized Swedish Medical LLM Benchmark,comprising over 2600 questions across various medical domains, designed to assessLLM performance in a clinically relevant, language-specific manner (Paper E).Additionally, the medical reasoning capabilities of LLMs, such as DeepSeek R1, arerigorously assessed, focusing on their capacity for general medical diagnosticreasoning (Paper F).(3) Application and Best Practice for Working with AI in Healthcare: This sectionaddresses the practical, ethical, and user experience (UX) considerations forvimplementing AI in healthcare. It proposes a novel user interface paradigm throughan AI-powered journaling application designed for personal health management,illustrating a low-risk, user-centric approach to AI integration (Paper G).Complementing this, it develops harm reduction strategies for the thoughtful use ofLLMs in the medical domain, providing perspectives for both patients and cliniciansto maximize utility while mitigating risks, thereby establishing best practices forresponsible AI engagement (Paper H).Collectively, this work advances the field by providing new tools and methodologiesfor early disease detection using speech and multimodal data, establishing robustevaluation methods for ASR and LLMs in the medical domain, and offering pathwaysand frameworks for responsible, user-centered, and effective AI implementation inhealthcare.

Abstract [sv]

Denna doktorsavhandling undersöker potentialen hos avancerade tal- ochspråkteknologier, drivna av djupinlärning, för att förbättra klinisk diagnostik ochpatientvård, främst inom svensk hälso- och sjukvård. Forskningen omfattar åttacentrala artiklar, vilka presenteras inom tre huvudsakliga avsnitt:(1) Datainsamling och maskininlärning för tal: Detta avsnitt utforskar användningenav multimodal data och avancerade talbearbetningstekniker för kliniskatillämpningar. Det inkluderar forskning om användning av multimodaldatainsamling från kliniska intervjuer för att identifiera digitala biomarkörer fördemens (Artikel A). Vidare utvecklas ett automatiserat system med djupinlärning föratt utvärdera oral diadochokinesis-testet vid motoriska talrubbningar, vilket visarhögre noggrannhet än mänskliga bedömare och föreslår ett kliniskt gränssnitt medmänniska-i-loopen (Artikel B). Avsnittet utvärderar även prestandan hos system förautomatisk taligenkänning (ASR) genom att jämföra felkvoter mellan talare medsvenska som modersmål respektive andraspråk (Artikel C) och undersökerdataaugmenteringstekniker för att förbättra ASR-noggrannheten för personer medafasi (Artikel D).(2) Utvärdering av stora språkmodeller (LLM:er) inom det medicinska området:Detta avsnitt fokuserar på att etablera robusta metoder för att bedöma storaspråkmodeller (LLM:er) i en medicinsk kontext. Det beskriver utvecklingen av ettspecialiserat svenskt medicinskt LLM-benchmark, bestående av över 2600 frågorinom olika medicinska domäner, avsett att utvärdera LLM:ers prestanda på ettkliniskt relevant och språkspecifikt sätt (Artikel E). Därtill bedöms den medicinskaresonemangsförmågan hos LLM:er, såsom DeepSeek R1, noggrant, med fokus påderas kapacitet för generell medicinsk diagnostiskt resonerande (Artikel F).(3) Applikationer och bästa praxis för AI inom hälso- och sjukvård: Detta avsnittbehandlar praktiska, etiska och användarupplevelsemässiga (UX) överväganden vidimplementering av AI inom hälso- och sjukvården. Ett nyttviianvändargränssnittsparadigm föreslås genom en AI-driven applikation för att föra enpersonlig hälsodagbok. Den är utformad för personlig hälsohantering och illustreraren lågrisk, användarcentrerad strategi för AI-integration (Artikel G). Somkomplement utvecklas strategier för harm reduction för genomtänkt användning avLLM:er inom det medicinska området. Dessa strategier erbjuder perspektiv för bådepatienter och kliniker för att maximera nyttan och samtidigt minimera riskerna, ochetablerar därmed bästa praxis för ansvarsfullt AI-engagemang (Artikel H).Sammantaget bidrar detta arbete till forskningsfältet genom att tillhandahålla nyaverktyg och metoder för tidig sjukdomsdetektion med hjälp av tal- och multimodaldata, etablera robusta utvärderingsmetoder för ASR och LLM:er inom det medicinskaområdet, samt erbjuda vägledning och ramverk för en ansvarsfull, användarcentreradoch effektiv implementering av AI inom hälso- och sjukvården.

Place, publisher, year, edition, pages
Stockholm: KTH Royal Institute of Technology, 2025. p. xxi, 82
Series
TRITA-EECS-AVL ; 2025:83
Keywords
Large Language Models (LLMs), Automatic Speech Recognition (ASR), Neurodegenerative Disorders, Swedish Language, Clinical Diagnostics, AI Ethics, Medical Reasoning, Multimodal Data, Tal- och språkteknologi, maskininlärning, djupinlärning, automatisk taligenkänning (ASR), stora språkmodeller (LLM), medicinsk diagnostik, digitala biomarkörer, afasi, demens, hälso- och sjukvård, användarupplevelse (UX), harm reduction, AI-integration
National Category
Artificial Intelligence
Research subject
Speech and Music Communication
Identifiers
urn:nbn:se:kth:diva-371738 (URN)978-91-8106-404-9 (ISBN)
Public defence
2025-12-12, https://kth-se.zoom.us/j/69936124469, Kollegiesalen, Brinellvägen 8, Stockholm, 13:00 (English)
Opponent
Supervisors
Note

QC 20251022

Available from: 2025-10-22 Created: 2025-10-17 Last updated: 2025-11-13Bibliographically approved

Open Access in DiVA

fulltext(889 kB)292 downloads
File information
File name FULLTEXT01.pdfFile size 889 kBChecksum SHA-512
29a771b08606cd23c18f3553581acd2e2497e739586585a022399f027f7d6b6fbbe845446fcfb25c52f387607664ad4850a7680b7fe7a21206ff319bf3882e49
Type fulltextMimetype application/pdf

Scopus

Authority records

Moell, BirgerO'Regan, JimMehta, ShivamKirkland, AmbikaLameris, HarmGustafsson, JoakimBeskow, Jonas

Search in DiVA

By author/editor
Moell, BirgerO'Regan, JimMehta, ShivamKirkland, AmbikaLameris, HarmGustafsson, JoakimBeskow, Jonas
By organisation
Speech, Music and Hearing, TMH
Other Electrical Engineering, Electronic Engineering, Information Engineering

Search outside of DiVA

GoogleGoogle Scholar
Total: 293 downloads
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

urn-nbn

Altmetric score

urn-nbn
Total: 814 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf