kth.sePublications
Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Multilingual Turn-taking Prediction Using Voice Activity Projection
Graduate School of Informatics, Kyoto University, Japan.
KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.
KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.ORCID iD: 0000-0003-3513-4132
Graduate School of Informatics, Kyoto University, Japan.
Show others and affiliations
2024 (English)In: 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation, LREC-COLING 2024 - Main Conference Proceedings, European Language Resources Association (ELRA) , 2024, p. 11873-11883Conference paper, Published paper (Refereed)
Abstract [en]

This paper investigates the application of voice activity projection (VAP), a predictive turn-taking model for spoken dialogue, on multilingual data, encompassing English, Mandarin, and Japanese. The VAP model continuously predicts the upcoming voice activities of participants in dyadic dialogue, leveraging a cross-attention Transformer to capture the dynamic interplay between participants. The results show that a monolingual VAP model trained on one language does not make good predictions when applied to other languages. However, a multilingual model, trained on all three languages, demonstrates predictive performance on par with monolingual models across all languages. Further analyses show that the multilingual model has learned to discern the language of the input signal. We also analyze the sensitivity to pitch, a prosodic cue that is thought to be important for turn-taking. Finally, we compare two different audio encoders, contrastive predictive coding (CPC) pre-trained on English, with a recent model based on multilingual wav2vec 2.0 (MMS).

Place, publisher, year, edition, pages
European Language Resources Association (ELRA) , 2024. p. 11873-11883
Keywords [en]
Multilingual, Spoken Dialogue System, Turn-taking, Voice Activity Projection
National Category
Natural Language Processing General Language Studies and Linguistics Computer Sciences
Identifiers
URN: urn:nbn:se:kth:diva-348790Scopus ID: 2-s2.0-85195914079OAI: oai:DiVA.org:kth-348790DiVA, id: diva2:1878700
Conference
Joint 30th International Conference on Computational Linguistics and 14th International Conference on Language Resources and Evaluation, LREC-COLING 2024, Hybrid, May 20-25, 2024, Torino, Italy
Projects
tmh_turntaking
Note

Part of ISBN 978-249381410-4

QC 20241028

Available from: 2024-06-27 Created: 2024-06-27 Last updated: 2025-02-01Bibliographically approved

Open Access in DiVA

No full text in DiVA

Other links

Scopushttps://aclanthology.org/2024.lrec-main.1036

Authority records

Jiang, BingerEkstedt, ErikSkantze, Gabriel

Search in DiVA

By author/editor
Jiang, BingerEkstedt, ErikSkantze, Gabriel
By organisation
Speech, Music and Hearing, TMH
Natural Language ProcessingGeneral Language Studies and LinguisticsComputer Sciences

Search outside of DiVA

GoogleGoogle Scholar

urn-nbn

Altmetric score

urn-nbn
Total: 165 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf