kth.sePublications
Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Real-time and Continuous Turn-taking Prediction Using Voice Activity Projection
Kyoto University, Japan.
KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.
KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.ORCID iD: 0000-0003-3513-4132
Kyoto University, Japan.
Show others and affiliations
2024 (English)Conference paper, Oral presentation with published abstract (Refereed)
Abstract [en]

A demonstration of a real-time and continuous turn-taking prediction system is presented. The system is based on a voice activity projection (VAP) model, which directly maps dialogue stereo audio to future voice activities. The VAP model includes contrastive predictive coding (CPC) and self-attention transformers, followed by a cross-attention transformer. We examine the effect of the input context audio length and demonstrate that the proposed system can operate in real-time with CPU settings, with minimal performance degradation.

Place, publisher, year, edition, pages
2024.
National Category
Natural Language Processing
Identifiers
URN: urn:nbn:se:kth:diva-359141DOI: 10.48550/arXiv.2401.04868OAI: oai:DiVA.org:kth-359141DiVA, id: diva2:1931347
Conference
The 14th International Workshop on Spoken Dialogue Systems Technology (IWSDS), Sapporo, Japan, March 4-6, 2024
Projects
tmh_turntaking
Note

QC 20250325

Available from: 2025-01-27 Created: 2025-01-27 Last updated: 2025-03-25Bibliographically approved

Open Access in DiVA

No full text in DiVA

Other links

Publisher's full text

Authority records

Jiang, Bing'erEkstedt, ErikSkantze, Gabriel

Search in DiVA

By author/editor
Jiang, Bing'erEkstedt, ErikSkantze, Gabriel
By organisation
Speech, Music and Hearing, TMH
Natural Language Processing

Search outside of DiVA

GoogleGoogle Scholar

doi
urn-nbn

Altmetric score

doi
urn-nbn
Total: 44 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf