kth.sePublications KTH
Operational message
There are currently operational disruptions. Troubleshooting is in progress.
Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Optimizing ASR Models with Semantic Information
NTNU, Dept Elect Syst, Trondheim, Norway.
KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH. NTNU, Dept Elect Syst, Trondheim, Norway; KTH Royal Inst Technol, EECS, Stockholm, Sweden.ORCID iD: 0000-0002-3323-5311
NTNU, Dept Elect Syst, Trondheim, Norway.
2026 (English)In: TEXT, SPEECH, AND DIALOGUE, TSD 2025, PT I / [ed] Ekstein, K Konopik, M Prazak, O Partl, F, Springer Nature , 2026, Vol. 16029, p. 25-35Conference paper, Published paper (Refereed)
Abstract [en]

ASR models are commonly trained with an objective function different from the evaluation criteria. For example, ASR models trained with CTC loss maximize the likelihood of the output symbols, but these models are evaluated using WER. Direct optimization to minimize WER has been proposed to alleviate this gap. However, we believe a better approach is to utilize a semantic-based metric instead, such as Aligned Semantic Distance (ASD). In this work, we propose a joint loss function using ASD and CTC to fine-tune a wav2vec2 model for speech recognition. We define ASD loss as the expected ASD score over the N-best hypotheses from the model. Our results show that our approach achieves about 3% relative improvement on ASD scores and WERs. Moreover, we analyze the errors by looking at their distribution with respect to part-of-speech types. Finally, we demonstrate that training with our proposed loss function improves the model's performance on downstream NLP tasks.

Place, publisher, year, edition, pages
Springer Nature , 2026. Vol. 16029, p. 25-35
Series
Lecture Notes in Artificial Intelligence, ISSN 2945-9133
Keywords [en]
speech recognition, semantic context
National Category
Computer Sciences
Identifiers
URN: urn:nbn:se:kth:diva-375661DOI: 10.1007/978-3-032-02548-7_3ISI: 001576343000003Scopus ID: 2-s2.0-105014445412ISBN: 978-3-032-02547-0 (print)ISBN: 978-3-032-02548-7 (print)OAI: oai:DiVA.org:kth-375661DiVA, id: diva2:2029104
Conference
28th International Conference on Text Speech and Dialogue-TSD-Annual, AUG 25-28, 2025, Erlangen, GERMANY
Note

QC 20260116

Available from: 2026-01-16 Created: 2026-01-16 Last updated: 2026-01-16Bibliographically approved

Open Access in DiVA

No full text in DiVA

Other links

Publisher's full textScopus

Authority records

Salvi, Giampiero

Search in DiVA

By author/editor
Salvi, Giampiero
By organisation
Speech, Music and Hearing, TMH
Computer Sciences

Search outside of DiVA

GoogleGoogle Scholar

doi
isbn
urn-nbn

Altmetric score

doi
isbn
urn-nbn
Total: 21 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf