kth.sePublications
Change search
Link to record
Permanent link

Direct link
Publications (10 of 88) Show all publications
Cao, X., Fan, Z., Svendsen, T. & Salvi, G. (2024). A Framework for Phoneme-Level Pronunciation Assessment Using CTC. In: Interspeech 2024: . Paper presented at 25th Interspeech Conferece 2024, Kos Island, Greece, September 1-5, 2024 (pp. 302-306). International Speech Communication Association
Open this publication in new window or tab >>A Framework for Phoneme-Level Pronunciation Assessment Using CTC
2024 (English)In: Interspeech 2024, International Speech Communication Association , 2024, p. 302-306Conference paper, Published paper (Refereed)
Abstract [en]

Traditional phoneme-level goodness of pronunciation (GOP) methods require phoneme to speech alignment. The drawback is that these methods, by their definitions, are prone to alignment errors and preclude the possibility of deletion and insertion errors in pronunciation. We produce experimental evidence that CTC-based methods can be used in traditional GOP estimation in spite of their “peaky” output behaviour and may be less prone to alignment errors than traditional methods. We also propose a new framework for GOP estimation based on CTC-trained model that is independent of speech-phoneme alignment. By accounting for deletion and insertions as well as substitution errors, we show that our framework outperform alignment-based method. Our experimental results are based on the CMU-kids dataset for child speech and on the Speechocean762 containing both child and adult speech speakers. Our best method achieves 29.02% relative improvement over the baseline GOP methods.

Place, publisher, year, edition, pages
International Speech Communication Association, 2024
Keywords
child speech, CTC, end-to-end, goodness of pronunciation, pronunciation assessment
National Category
Computer Sciences Natural Language Processing
Identifiers
urn:nbn:se:kth:diva-358874 (URN)10.21437/Interspeech.2024-459 (DOI)2-s2.0-85214811904 (Scopus ID)
Conference
25th Interspeech Conferece 2024, Kos Island, Greece, September 1-5, 2024
Note

QC 20250127

Available from: 2025-01-23 Created: 2025-01-23 Last updated: 2025-01-27Bibliographically approved
Kynych, F., Cerva, P., Zdansky, J., Svendsen, T. & Salvi, G. (2024). A lightweight approach to real-time speaker diarization: from audio toward audio-visual data streams. EURASIP Journal on Audio, Speech, and Music Processing, 2024(1), Article ID 62.
Open this publication in new window or tab >>A lightweight approach to real-time speaker diarization: from audio toward audio-visual data streams
Show others...
2024 (English)In: EURASIP Journal on Audio, Speech, and Music Processing, ISSN 1687-4714, E-ISSN 1687-4722, Vol. 2024, no 1, article id 62Article in journal (Refereed) Published
Abstract [en]

This manuscript deals with the task of real-time speaker diarization (SD) for stream-wise data processing. Therefore, in contrast to most of the existing papers, it considers not only the accuracy but also the computational demands of individual investigated methods. We first propose a new lightweight scheme allowing us to perform speaker diarization of streamed audio data. Our approach utilizes a modified residual network with squeeze-and-excitation blocks (SE-ResNet-34) to extract speaker embeddings in an optimized way using cached buffers. These embeddings are subsequently used for voice activity detection (VAD) and block-online k-means clustering with a look-ahead mechanism. The described scheme yields results similar to the reference offline system while operating solely on a CPU with a low real-time factor (RTF) below 0.1 and a constant latency of around 5.5 s. In the next part of the work, our research moves toward much more demanding and complex real-time processing of audio-visual data streams. For this purpose, we extend the above-mentioned scheme for audio data processing by adding an audio-video module. This module utilizes SyncNet combined with visual embeddings for identity tracking. Our resulting multi-modal SD framework then combines the outputs from audio and audio-video modules by using a new overlap-based fusion strategy. It yields diarization error rates that are competitive with the existing state-of-the-art offline audio-visual methods while allowing us to process various audio-video streams, e.g., from Internet or TV broadcasts, in real-time using GPU and with the same latency as for audio data processing.

Place, publisher, year, edition, pages
Springer Nature, 2024
Keywords
Speaker diarization, Streamed data processing, Multi-modal, Audio-visual, Deep learning
National Category
Computer Sciences
Identifiers
urn:nbn:se:kth:diva-357545 (URN)10.1186/s13636-024-00382-2 (DOI)001365828000001 ()2-s2.0-85210595217 (Scopus ID)
Note

QC 20241209

Available from: 2024-12-09 Created: 2024-12-09 Last updated: 2024-12-09Bibliographically approved
Adiban, M., Siniscalchi, S. M. & Salvi, G. (2023). A step-by-step training method for multi generator GANs with application to anomaly detection and cybersecurity. Neurocomputing, 537, 296-308
Open this publication in new window or tab >>A step-by-step training method for multi generator GANs with application to anomaly detection and cybersecurity
2023 (English)In: Neurocomputing, ISSN 0925-2312, E-ISSN 1872-8286, Vol. 537, p. 296-308Article in journal (Refereed) Published
Abstract [en]

Cyber attacks and anomaly detection are problems where the data is often highly unbalanced towards normal observations. Furthermore, the anomalies observed in real applications may be significantly different from the ones contained in the training data. It is, therefore, desirable to study methods that are able to detect anomalies only based on the distribution of the normal data. To address this problem, we propose a novel objective function for generative adversarial networks (GANs), referred to as STEPGAN. STEP-GAN simulates the distribution of possible anomalies by learning a modified version of the distribution of the task-specific normal data. It leverages multiple generators in a step-by-step interaction with a discriminator in order to capture different modes in the data distribution. The discriminator is optimized to distinguish not only between normal data and anomalies but also between the different generators, thus encouraging each generator to model a different mode in the distribution. This reduces the well-known mode collapse problem in GAN models considerably. We tested our method in the areas of power systems and network traffic control systems (NTCSs) using two publicly available highly imbalanced datasets, ICS (Industrial Control System) security dataset and UNSW-NB15, respectively. In both application domains, STEP-GAN outperforms the state-of-the-art systems as well as the two baseline systems we implemented as a comparison. In order to assess the generality of our model, additional experiments were carried out on seven real-world numerical datasets for anomaly detection in a variety of domains. In all datasets, the number of normal samples is significantly more than that of abnormal samples. Experimental results show that STEP-GAN outperforms several semi-supervised methods while being competitive with supervised methods.

Place, publisher, year, edition, pages
Elsevier BV, 2023
Keywords
Anomaly detection, One -class classification, GAN, Mode collapse, Cyber security
National Category
Computer Sciences
Identifiers
urn:nbn:se:kth:diva-327437 (URN)10.1016/j.neucom.2023.03.056 (DOI)000978367500001 ()2-s2.0-85151669864 (Scopus ID)
Note

QC 20230529

Available from: 2023-05-29 Created: 2023-05-29 Last updated: 2023-05-29Bibliographically approved
Cao, X., Fan, Z., Svendsen, T. & Salvi, G. (2023). An Analysis of Goodness of Pronunciation for Child Speech. In: Interspeech 2023: . Paper presented at 24th International Speech Communication Association, Interspeech 2023, August 20-24, 2023, Dublin, Ireland (pp. 4613-4617). International Speech Communication Association
Open this publication in new window or tab >>An Analysis of Goodness of Pronunciation for Child Speech
2023 (English)In: Interspeech 2023, International Speech Communication Association , 2023, p. 4613-4617Conference paper, Published paper (Refereed)
Abstract [en]

In this paper, we study the use of goodness of pronunciation (GOP) on child speech. We first compare the distributions of GOP scores on several open datasets representing various dimensions of speech variability. We show that the GOP distribution over CMU Kids, corresponding to young age, has larger spread than those on datasets representing other dimensions, i.e., accent, dialect, spontaneity and environmental conditions. We hypothesize that the increased variability of pronunciation in young age may impair the use of traditional mispronunciation detection methods for children. To support this hypothesis, we perform simulated mispronunciation experiments both for children and adults using different variants of the GOP algorithm. We also compare the results to real-case mispronunciations for native children showing that GOP is less effective for child speech than for adult speech.

Place, publisher, year, edition, pages
International Speech Communication Association, 2023
Keywords
ASR, child speech, data scarcity, GOP, mispronunciation detection and diagnosis, speech assessment
National Category
Computer Sciences Natural Language Processing
Identifiers
urn:nbn:se:kth:diva-337872 (URN)10.21437/Interspeech.2023-743 (DOI)001186650304155 ()2-s2.0-85171580096 (Scopus ID)
Conference
24th International Speech Communication Association, Interspeech 2023, August 20-24, 2023, Dublin, Ireland
Note

QC 20241011

Available from: 2023-10-10 Created: 2023-10-10 Last updated: 2025-02-01Bibliographically approved
Stenwig, E., Salvi, G., Rossi, P. S. & Skjaervold, N. K. (2023). Comparison of correctly and incorrectly classified patients for in-hospital mortality prediction in the intensive care unit. BMC Medical Research Methodology, 23(1), Article ID 102.
Open this publication in new window or tab >>Comparison of correctly and incorrectly classified patients for in-hospital mortality prediction in the intensive care unit
2023 (English)In: BMC Medical Research Methodology, E-ISSN 1471-2288, Vol. 23, no 1, article id 102Article in journal (Refereed) Published
Abstract [en]

Background

The use of machine learning is becoming increasingly popular in many disciplines, but there is still an implementation gap of machine learning models in clinical settings. Lack of trust in models is one of the issues that need to be addressed in an effort to close this gap. No models are perfect, and it is crucial to know in which use cases we can trust a model and for which cases it is less reliable.

Methods

Four different algorithms are trained on the eICU Collaborative Research Database using similar features as the APACHE IV severity-of-disease scoring system to predict hospital mortality in the ICU. The training and testing procedure is repeated 100 times on the same dataset to investigate whether predictions for single patients change with small changes in the models. Features are then analysed separately to investigate potential differences between patients consistently classified correctly and incorrectly.

Results

A total of 34 056 patients (58.4%) are classified as true negative, 6 527 patients (11.3%) as false positive, 3 984 patients (6.8%) as true positive, and 546 patients (0.9%) as false negatives. The remaining 13 108 patients (22.5%) are inconsistently classified across models and rounds. Histograms and distributions of feature values are compared visually to investigate differences between groups.ConclusionsIt is impossible to distinguish the groups using single features alone. Considering a combination of features, the difference between the groups is clearer. Incorrectly classified patients have features more similar to patients with the same prediction rather than the same outcome.

Place, publisher, year, edition, pages
Springer Nature, 2023
Keywords
Machine learning, Explainability, Mortality prediction, eICU, SHAP values
National Category
Computer Sciences Clinical Medicine
Identifiers
urn:nbn:se:kth:diva-327383 (URN)10.1186/s12874-023-01921-9 (DOI)000974652000002 ()37095430 (PubMedID)2-s2.0-85153687506 (Scopus ID)
Note

QC 20230526

Available from: 2023-05-26 Created: 2023-05-26 Last updated: 2024-01-17Bibliographically approved
Getman, Y., Phan, N., Al-Ghezi, R., Voskoboinik, E., Singh, M., Grosz, T., . . . Ylinen, S. (2023). Developing an AI-Assisted Low-Resource Spoken Language Learning App for Children. IEEE Access, 11, 86025-86037
Open this publication in new window or tab >>Developing an AI-Assisted Low-Resource Spoken Language Learning App for Children
Show others...
2023 (English)In: IEEE Access, E-ISSN 2169-3536, Vol. 11, p. 86025-86037Article in journal (Refereed) Published
Abstract [en]

Computer-assisted Language Learning (CALL) is a rapidly developing area accelerated by advancements in the field of AI. A well-designed and reliable CALL system allows students to practice language skills, like pronunciation, any time outside of the classroom. Furthermore, gamification via mobile applications has shown encouraging results on learning outcomes and motivates young users to practice more and perceive language learning as a positive experience. In this work, we adapt the latest speech recognition technology to be a part of an online pronunciation training system for small children. As part of our gamified mobile application, our models will assess the pronunciation quality of young Swedish children diagnosed with Speech Sound Disorder, and participating in speech therapy. Additionally, the models provide feedback to young non-native children learning to pronounce Swedish and Finnish words. Our experiments revealed that these new models fit into an online game as they function as speech recognizers and pronunciation evaluators simultaneously. To make our systems more trustworthy and explainable, we investigated whether the combination of modern input attribution algorithms and time-aligned transcripts can explain the decisions made by the models, give us insights into how the models work and provide a tool to develop more reliable solutions.

Place, publisher, year, edition, pages
Institute of Electrical and Electronics Engineers (IEEE), 2023
Keywords
ASR, children's speech, L2 speech, speech rating, SSD, wav2vec2
National Category
Natural Language Processing
Identifiers
urn:nbn:se:kth:diva-335200 (URN)10.1109/ACCESS.2023.3304274 (DOI)001051656600001 ()2-s2.0-85167833032 (Scopus ID)
Note

QC 20230925

Available from: 2023-09-25 Created: 2023-09-25 Last updated: 2025-02-07Bibliographically approved
Abdelnour, J., Rouat, J. & Salvi, G. (2023). NAAQA: A Neural Architecture for Acoustic Question Answering. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(4), 4997-5009
Open this publication in new window or tab >>NAAQA: A Neural Architecture for Acoustic Question Answering
2023 (English)In: IEEE Transactions on Pattern Analysis and Machine Intelligence, ISSN 0162-8828, E-ISSN 1939-3539, Vol. 45, no 4, p. 4997-5009Article in journal (Refereed) Published
Abstract [en]

The goal of the Acoustic Question Answering (AQA) task is to answer a free-form text question about the content of an acoustic scene. It was inspired by the Visual Question Answering (VQA) task. In this paper, based on the previously introduced CLEAR dataset, we propose a new benchmark for AQA, namely CLEAR2, that emphasizes the specific challenges of acoustic inputs. These include handling of variable duration scenes, and scenes built with elementary sounds that differ between training and test set. We also introduce NAAQA, a neural architecture that leverages specific properties of acoustic inputs. The use of 1D convolutions in time and frequency to process 2D spectro-temporal representations of acoustic content shows promising results and enables reductions in model complexity. We show that time coordinate maps augment temporal localization capabilities which enhance performance of the network by ∼17 percentage points. On the other hand, frequency coordinate maps have little influence on this task. NAAQA achieves 79.5% of accuracy on the AQA task with ∼four times fewer parameters than the previously explored VQA model. We evaluate the performance of NAAQA on an independent data set reconstructed from DAQA. We also test the addition of a MALiMo module in our model on both CLEAR2 and DAQA. We provide a detailed analysis of the results for the different question types. We release the code to produce CLEAR2 as well as NAAQA to foster research in this newly emerging machine learning task.

Place, publisher, year, edition, pages
Institute of Electrical and Electronics Engineers (IEEE), 2023
National Category
Computer Sciences
Research subject
Computer Science
Identifiers
urn:nbn:se:kth:diva-324766 (URN)10.1109/tpami.2022.3194311 (DOI)000947840300064 ()36121954 (PubMedID)2-s2.0-85139450848 (Scopus ID)
Projects
IGLU
Note

QC 20250611

Available from: 2023-03-15 Created: 2023-03-15 Last updated: 2025-06-11Bibliographically approved
Rugayan, J., Salvi, G. & Svendsen, T. (2023). Perceptual and Task-Oriented Assessment of a Semantic Metric for ASR Evaluation. In: Interspeech 2023: . Paper presented at 24th International Speech Communication Association, Interspeech 2023, August 20-24, 2023, Dublin, Ireland (pp. 2158-2162). International Speech Communication Association
Open this publication in new window or tab >>Perceptual and Task-Oriented Assessment of a Semantic Metric for ASR Evaluation
2023 (English)In: Interspeech 2023, International Speech Communication Association , 2023, p. 2158-2162Conference paper, Published paper (Refereed)
Abstract [en]

Automatic speech recognition (ASR) systems have become a vital part of our everyday lives through their many applications. However, as much as we have developed in this regard, our most common evaluation method for ASR systems still remains to be word error rate (WER). WER does not give information on the severity of errors, which strongly impacts practical performance. As such, we examine a semantic-based metric called Aligned Semantic Distance (ASD) against WER and demonstrate its advantage over WER in two facets. First, we conduct a survey asking participants to score reference text and ASR transcription pairs. We perform a correlation analysis and show that ASD is more correlated to the human evaluation scores compared to WER. We also explore the feasibility of predicting human perception using ASD. Second, we demonstrate that ASD is more effective than WER as an indicator of performance on downstream NLP tasks such as named entity recognition and sentiment classification.

Place, publisher, year, edition, pages
International Speech Communication Association, 2023
Keywords
ASR evaluation metric, semantic context, user perception
National Category
Computer Sciences Natural Language Processing
Identifiers
urn:nbn:se:kth:diva-337837 (URN)10.21437/Interspeech.2023-1778 (DOI)001186650302068 ()2-s2.0-85171598286 (Scopus ID)
Conference
24th International Speech Communication Association, Interspeech 2023, August 20-24, 2023, Dublin, Ireland
Note

QC 20241015

Available from: 2023-10-09 Created: 2023-10-09 Last updated: 2025-02-01Bibliographically approved
Shahrebabaki, A. S., Salvi, G., Svendsen, T. & Siniscalchi, S. M. (2022). Acoustic-to-Articulatory Mapping With Joint Optimization of Deep Speech Enhancement and Articulatory Inversion Models. IEEE/ACM transactions on audio, speech, and language processing, 30, 135-147
Open this publication in new window or tab >>Acoustic-to-Articulatory Mapping With Joint Optimization of Deep Speech Enhancement and Articulatory Inversion Models
2022 (English)In: IEEE/ACM transactions on audio, speech, and language processing, ISSN 2329-9290, Vol. 30, p. 135-147Article in journal (Refereed) Published
Abstract [en]

We investigate the problem of speaker independent acoustic-to-articulatory inversion (AAI) in noisy conditions within the deep neural network (DNN) framework. In contrast with recent results in the literature, we argue that a DNN vector-to-vector regression front-end for speech enhancement (DNN-SE) can play a key role in AAI when used to enhance spectral features prior to AAI back-end processing. We experimented with single- and multi-task training strategies for the DNN-SE block finding the latter to be beneficial to AAI. Furthermore, we show that coupling DNN-SE producing enhanced speech features with an AAI trained on clean speech outperforms a multi-condition AAI (AAI-MC) when tested on noisy speech. We observe a 15% relative improvement in the Pearson's correlation coefficient (PCC) between our system and AAI-MC at 0 dB signal-to-noise ratio on the Haskins corpus. Our approach also compares favourably against using a conventional DSP approach to speech enhancement (MMSE with IMCRA) in the front-end. Finally, we demonstrate the utility of articulatory inversion in a downstream speech application. We report significant WER improvements on an automatic speech recognition task in mismatched conditions based on the Wall Street Journal corpus (WSJ) when leveraging articulatory information estimated by AAI-MC system over spectral-alone speech features.

Place, publisher, year, edition, pages
Institute of Electrical and Electronics Engineers (IEEE), 2022
Keywords
Noise measurement, Speech enhancement, Task analysis, Mel frequency cepstral coefficient, Training, Hidden Markov models, Deep learning, Deep neural network, acoustic-to-articulatory inversion, multi-task training, speaker independent models
National Category
Natural Language Processing
Identifiers
urn:nbn:se:kth:diva-307335 (URN)10.1109/TASLP.2021.3133218 (DOI)000735507400007 ()2-s2.0-85121342065 (Scopus ID)
Note

QC 20220124

Available from: 2022-01-24 Created: 2022-01-24 Last updated: 2025-02-07Bibliographically approved
Ásgrímsson, D. S., González, I., Salvi, G. & Karoumi, R. (2022). Bayesian Deep Learning for Vibration-Based Bridge Damage Detection. In: Structural Integrity: (pp. 27-43). Springer Nature, 21
Open this publication in new window or tab >>Bayesian Deep Learning for Vibration-Based Bridge Damage Detection
2022 (English)In: Structural Integrity, Springer Nature , 2022, Vol. 21, p. 27-43Chapter in book (Refereed)
Abstract [en]

A machine learning approach to damage detection is presented for a bridge structural health monitoring (SHM) system. The method is validated on the renowned Z24 bridge benchmark dataset where a sensor instrumented, three-span bridge was monitored for almost a year before being deliberately damaged in a realistic and controlled way. Several damage cases were successfully detected, making this a viable approach in a data-based bridge SHM system. The method addresses directly a critical issue in most data-based SHM systems, which is that the collected training data will not contain all natural weather events and load conditions. A SHM system that is trained on such limited data must be able to handle uncertainty in its predictions to prevent false damage detections. A Bayesian autoencoder neural network is trained to reconstruct raw sensor data sequences, with uncertainty bounds in prediction. The uncertainty-adjusted reconstruction error of an unseen sequence is compared to a healthy-state error distribution, and the sequence is accepted or rejected based on the fidelity of the reconstruction. If the proportion of rejected sequences goes over a predetermined threshold, the bridge is determined to be in a damaged state. This is a fully operational, machine learning-based bridge damage detection system that is learned directly from raw sensor data.

Place, publisher, year, edition, pages
Springer Nature, 2022
Series
Structural Integrity, ISSN 2522-560X ; 21
Keywords
Autoencoders, Bayesian deep learning, Bridge damage detection, Machine learning, Structural health monitoring, Z24 bridge benchmark
National Category
Infrastructure Engineering
Identifiers
urn:nbn:se:kth:diva-312838 (URN)10.1007/978-3-030-81716-9_2 (DOI)2-s2.0-85117941432 (Scopus ID)
Note

QC 20220530

Available from: 2022-05-30 Created: 2022-05-30 Last updated: 2022-06-25Bibliographically approved
Organisations
Identifiers
ORCID iD: ORCID iD iconorcid.org/0000-0002-3323-5311

Search in DiVA

Show all publications