kth.sePublications
Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Show & Tell: Voice Activity Projection and Turn-taking
KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.ORCID iD: 0000-0003-3513-4132
KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.ORCID iD: 0000-0002-8579-1790
2023 (English)In: Interspeech 2023, International Speech Communication Association , 2023, p. 2020-2021Conference paper, Published paper (Refereed)
Abstract [en]

We present Voice Activity Projection (VAP), a model trained on spontaneous spoken dialog with the objective to incrementally predict future voice activity. Similar to a language model, it is trained through self-supervised learning and outputs a probability distribution over discrete states that corresponds to the joint future voice activity of the dialog interlocutors. The model is well-defined over overlapping speech regions, resilient towards microphone “bleed-over” and considers the speech of both speakers (e.g., a user and an agent) to provide the most likely next speaker. VAP is a general turn-taking model which can serve as the base for turn-taking decisions in spoken dialog systems, an automatic tool useful for linguistics and conversational analysis, an automatic evaluation metric for conversational text-to-speech models, and possibly many other tasks related to spoken dialog interaction.

Place, publisher, year, edition, pages
International Speech Communication Association , 2023. p. 2020-2021
Keywords [en]
spoken dialog, text-to-speech, turn-taking
National Category
Natural Language Processing Computer Sciences
Identifiers
URN: urn:nbn:se:kth:diva-337875ISI: 001186650302038Scopus ID: 2-s2.0-85171575920OAI: oai:DiVA.org:kth-337875DiVA, id: diva2:1803881
Conference
24th International Speech Communication Association, Interspeech 2023, August 20-24, 2023, Dublin, Ireland
Note

QC 20241014

Available from: 2023-10-10 Created: 2023-10-10 Last updated: 2025-02-01Bibliographically approved

Open Access in DiVA

No full text in DiVA

Scopus

Authority records

Ekstedt, ErikSkantze, Gabriel

Search in DiVA

By author/editor
Ekstedt, ErikSkantze, Gabriel
By organisation
Speech, Music and Hearing, TMH
Natural Language ProcessingComputer Sciences

Search outside of DiVA

GoogleGoogle Scholar

urn-nbn

Altmetric score

urn-nbn
Total: 532 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf