ASR models are commonly trained with an objective function different from the evaluation criteria. For example, ASR models trained with CTC loss maximize the likelihood of the output symbols, but these models are evaluated using WER. Direct optimization to minimize WER has been proposed to alleviate this gap. However, we believe a better approach is to utilize a semantic-based metric instead, such as Aligned Semantic Distance (ASD). In this work, we propose a joint loss function using ASD and CTC to fine-tune a wav2vec2 model for speech recognition. We define ASD loss as the expected ASD score over the N-best hypotheses from the model. Our results show that our approach achieves about 3% relative improvement on ASD scores and WERs. Moreover, we analyze the errors by looking at their distribution with respect to part-of-speech types. Finally, we demonstrate that training with our proposed loss function improves the model's performance on downstream NLP tasks.
QC 20260116