kth.sePublications KTH
Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
A lightweight approach to real-time speaker diarization: from audio toward audio-visual data streams
Tech Univ Liberec, Fac Mechatron Informat & Interdisciplinary Studies, Studentska 2, Liberec 46117, Czech Republic..
Tech Univ Liberec, Fac Mechatron Informat & Interdisciplinary Studies, Studentska 2, Liberec 46117, Czech Republic..
Tech Univ Liberec, Fac Mechatron Informat & Interdisciplinary Studies, Studentska 2, Liberec 46117, Czech Republic..
Norwegian Univ Sci & Technol, Dept Elect Syst, NO-7491 Trondheim, Norway..
Show others and affiliations
2024 (English)In: EURASIP Journal on Audio, Speech, and Music Processing, ISSN 1687-4714, E-ISSN 1687-4722, Vol. 2024, no 1, article id 62Article in journal (Refereed) Published
Abstract [en]

This manuscript deals with the task of real-time speaker diarization (SD) for stream-wise data processing. Therefore, in contrast to most of the existing papers, it considers not only the accuracy but also the computational demands of individual investigated methods. We first propose a new lightweight scheme allowing us to perform speaker diarization of streamed audio data. Our approach utilizes a modified residual network with squeeze-and-excitation blocks (SE-ResNet-34) to extract speaker embeddings in an optimized way using cached buffers. These embeddings are subsequently used for voice activity detection (VAD) and block-online k-means clustering with a look-ahead mechanism. The described scheme yields results similar to the reference offline system while operating solely on a CPU with a low real-time factor (RTF) below 0.1 and a constant latency of around 5.5 s. In the next part of the work, our research moves toward much more demanding and complex real-time processing of audio-visual data streams. For this purpose, we extend the above-mentioned scheme for audio data processing by adding an audio-video module. This module utilizes SyncNet combined with visual embeddings for identity tracking. Our resulting multi-modal SD framework then combines the outputs from audio and audio-video modules by using a new overlap-based fusion strategy. It yields diarization error rates that are competitive with the existing state-of-the-art offline audio-visual methods while allowing us to process various audio-video streams, e.g., from Internet or TV broadcasts, in real-time using GPU and with the same latency as for audio data processing.

Place, publisher, year, edition, pages
Springer Nature , 2024. Vol. 2024, no 1, article id 62
Keywords [en]
Speaker diarization, Streamed data processing, Multi-modal, Audio-visual, Deep learning
National Category
Computer Sciences
Identifiers
URN: urn:nbn:se:kth:diva-357545DOI: 10.1186/s13636-024-00382-2ISI: 001365828000001Scopus ID: 2-s2.0-85210595217OAI: oai:DiVA.org:kth-357545DiVA, id: diva2:1919448
Note

QC 20241209

Available from: 2024-12-09 Created: 2024-12-09 Last updated: 2024-12-09Bibliographically approved

Open Access in DiVA

No full text in DiVA

Other links

Publisher's full textScopus

Authority records

Salvi, Giampiero

Search in DiVA

By author/editor
Salvi, Giampiero
By organisation
Speech, Music and Hearing, TMH
In the same journal
EURASIP Journal on Audio, Speech, and Music Processing
Computer Sciences

Search outside of DiVA

GoogleGoogle Scholar

doi
urn-nbn

Altmetric score

doi
urn-nbn
Total: 190 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf