A lightweight approach to real-time speaker diarization: from audio toward audio-visual data streamsShow others and affiliations
2024 (English)In: EURASIP Journal on Audio, Speech, and Music Processing, ISSN 1687-4714, E-ISSN 1687-4722, Vol. 2024, no 1, article id 62
Article in journal (Refereed) Published
Abstract [en]
This manuscript deals with the task of real-time speaker diarization (SD) for stream-wise data processing. Therefore, in contrast to most of the existing papers, it considers not only the accuracy but also the computational demands of individual investigated methods. We first propose a new lightweight scheme allowing us to perform speaker diarization of streamed audio data. Our approach utilizes a modified residual network with squeeze-and-excitation blocks (SE-ResNet-34) to extract speaker embeddings in an optimized way using cached buffers. These embeddings are subsequently used for voice activity detection (VAD) and block-online k-means clustering with a look-ahead mechanism. The described scheme yields results similar to the reference offline system while operating solely on a CPU with a low real-time factor (RTF) below 0.1 and a constant latency of around 5.5 s. In the next part of the work, our research moves toward much more demanding and complex real-time processing of audio-visual data streams. For this purpose, we extend the above-mentioned scheme for audio data processing by adding an audio-video module. This module utilizes SyncNet combined with visual embeddings for identity tracking. Our resulting multi-modal SD framework then combines the outputs from audio and audio-video modules by using a new overlap-based fusion strategy. It yields diarization error rates that are competitive with the existing state-of-the-art offline audio-visual methods while allowing us to process various audio-video streams, e.g., from Internet or TV broadcasts, in real-time using GPU and with the same latency as for audio data processing.
Place, publisher, year, edition, pages
Springer Nature , 2024. Vol. 2024, no 1, article id 62
Keywords [en]
Speaker diarization, Streamed data processing, Multi-modal, Audio-visual, Deep learning
National Category
Computer Sciences
Identifiers
URN: urn:nbn:se:kth:diva-357545DOI: 10.1186/s13636-024-00382-2ISI: 001365828000001Scopus ID: 2-s2.0-85210595217OAI: oai:DiVA.org:kth-357545DiVA, id: diva2:1919448
Note
QC 20241209
2024-12-092024-12-092024-12-09Bibliographically approved