kth.sePublications
Change search
Link to record
Permanent link

Direct link
Publications (10 of 19) Show all publications
Stefanov, K., Beskow, J. & Salvi, G. (2020). Self-Supervised Vision-Based Detection of the Active Speaker as Support for Socially-Aware Language Acquisition. IEEE Transactions on Cognitive and Developmental Systems, 12(2), 250-259, Article ID 8758947.
Open this publication in new window or tab >>Self-Supervised Vision-Based Detection of the Active Speaker as Support for Socially-Aware Language Acquisition
2020 (English)In: IEEE Transactions on Cognitive and Developmental Systems, ISSN 2379-8920, E-ISSN 2379-8939, Vol. 12, no 2, p. 250-259, article id 8758947Article in journal (Refereed) Published
Abstract [en]

This paper presents a self-supervised method for visual detection of the active speaker in a multi-person spoken interaction scenario. Active speaker detection is a fundamental prerequisite for any artificial cognitive system attempting to acquire language in social settings. The proposed method is intended to complement the acoustic detection of the active speaker, thus improving the system robustness in noisy conditions. The method can detect an arbitrary number of possibly overlapping active speakers based exclusively on visual information about their face. Furthermore, the method does not rely on external annotations, thus complying with cognitive development. Instead, the method uses information from the auditory modality to support learning in the visual domain. This paper reports an extensive evaluation of the proposed method using a large multi-person face-to-face interaction dataset. The results show good performance in a speaker dependent setting. However, in a speaker independent setting the proposed method yields a significantly lower performance. We believe that the proposed method represents an essential component of any artificial cognitive system or robotic platform engaging in social interactions.

Place, publisher, year, edition, pages
Institute of Electrical and Electronics Engineers (IEEE), 2020
National Category
Computer Sciences
Identifiers
urn:nbn:se:kth:diva-260126 (URN)10.1109/TCDS.2019.2927941 (DOI)000542972700013 ()2-s2.0-85069908129 (Scopus ID)
Note

QC 20200625

Available from: 2019-09-25 Created: 2019-09-25 Last updated: 2024-06-24Bibliographically approved
Stefanov, K., Salvi, G., Kontogiorgos, D., Kjellström, H. & Beskow, J. (2019). Modeling of Human Visual Attention in Multiparty Open-World Dialogues. ACM Transactions on Human-Robot Interaction, 8(2), Article ID UNSP 8.
Open this publication in new window or tab >>Modeling of Human Visual Attention in Multiparty Open-World Dialogues
Show others...
2019 (English)In: ACM Transactions on Human-Robot Interaction, E-ISSN 2573-9522, Vol. 8, no 2, article id UNSP 8Article in journal (Refereed) Published
Abstract [en]

This study proposes, develops, and evaluates methods for modeling the eye-gaze direction and head orientation of a person in multiparty open-world dialogues, as a function of low-level communicative signals generated by his/hers interlocutors. These signals include speech activity, eye-gaze direction, and head orientation, all of which can be estimated in real time during the interaction. By utilizing these signals and novel data representations suitable for the task and context, the developed methods can generate plausible candidate gaze targets in real time. The methods are based on Feedforward Neural Networks and Long Short-Term Memory Networks. The proposed methods are developed using several hours of unrestricted interaction data and their performance is compared with a heuristic baseline method. The study offers an extensive evaluation of the proposed methods that investigates the contribution of different predictors to the accurate generation of candidate gaze targets. The results show that the methods can accurately generate candidate gaze targets when the person being modeled is in a listening state. However, when the person being modeled is in a speaking state, the proposed methods yield significantly lower performance.

Place, publisher, year, edition, pages
ASSOC COMPUTING MACHINERY, 2019
Keywords
Human-human interaction, open-world dialogue, eye-gaze direction, head orientation, multiparty
National Category
Computer graphics and computer vision
Identifiers
urn:nbn:se:kth:diva-255203 (URN)10.1145/3323231 (DOI)000472066800003 ()
Note

QC 20190904

Available from: 2019-09-04 Created: 2019-09-04 Last updated: 2025-02-07Bibliographically approved
Stefanov, K. & Beskow, J. (2017). A Real-time Gesture Recognition System for Isolated Swedish Sign Language Signs. In: Proceedings of the 4th European and 7th Nordic Symposium on Multimodal Communication (MMSYM 2016): . Paper presented at 4th European and 7th Nordic Symposium on Multimodal Communication. (MMSYM 2016), Copenhagen, 29-30 September 2016. Linköping University Electronic Press
Open this publication in new window or tab >>A Real-time Gesture Recognition System for Isolated Swedish Sign Language Signs
2017 (English)In: Proceedings of the 4th European and 7th Nordic Symposium on Multimodal Communication (MMSYM 2016), Linköping University Electronic Press , 2017Conference paper, Published paper (Refereed)
Place, publisher, year, edition, pages
Linköping University Electronic Press, 2017
National Category
Computer and Information Sciences
Identifiers
urn:nbn:se:kth:diva-218328 (URN)
Conference
4th European and 7th Nordic Symposium on Multimodal Communication. (MMSYM 2016), Copenhagen, 29-30 September 2016
Note

QC 20171128

Available from: 2017-11-27 Created: 2017-11-27 Last updated: 2024-03-18Bibliographically approved
Stefanov, K., Beskow, J. & Salvi, G. (2017). Vision-based Active Speaker Detection in Multiparty Interaction. In: Grounding Language Understanding: . Paper presented at Grounding Language Understanding GLU2017 August 25, 2017, KTH Royal Institute of Technology, Stockholm, Sweden.
Open this publication in new window or tab >>Vision-based Active Speaker Detection in Multiparty Interaction
2017 (English)In: Grounding Language Understanding, 2017Conference paper, Published paper (Refereed)
National Category
Computer graphics and computer vision
Identifiers
urn:nbn:se:kth:diva-211651 (URN)10.21437/GLU.2017-10 (DOI)
Conference
Grounding Language Understanding GLU2017 August 25, 2017, KTH Royal Institute of Technology, Stockholm, Sweden
Note

QC 20170809

Available from: 2017-08-08 Created: 2017-08-08 Last updated: 2025-02-07Bibliographically approved
Stefanov, K. & Beskow, J. (2016). A Multi-party Multi-modal Dataset for Focus of Visual Attention in Human-human and Human-robot Interaction. In: Proceedings of the 10th edition of the Language Resources and Evaluation Conference: . Paper presented at Proceedings of the 10th edition of the Language Resources and Evaluation Conference (LREC 2016, 23-28 of May.. ELRA
Open this publication in new window or tab >>A Multi-party Multi-modal Dataset for Focus of Visual Attention in Human-human and Human-robot Interaction
2016 (English)In: Proceedings of the 10th edition of the Language Resources and Evaluation Conference, ELRA , 2016Conference paper, Published paper (Refereed)
Place, publisher, year, edition, pages
ELRA, 2016
National Category
Computer Sciences
Identifiers
urn:nbn:se:kth:diva-187954 (URN)000526952504105 ()2-s2.0-85016436223 (Scopus ID)
Conference
Proceedings of the 10th edition of the Language Resources and Evaluation Conference (LREC 2016, 23-28 of May.
Note

QC 20211018

Available from: 2016-06-02 Created: 2016-06-02 Last updated: 2022-06-22Bibliographically approved
Stefanov, K. & Beskow, J. (2016). Gesture Recognition System for Isolated Sign Language Signs. In: : . Paper presented at The 4th European and 7th Nordic Symposium on Multimodal Communication, 29-30 September 2016, University of Copenhagen, Denmark (pp. 57-59).
Open this publication in new window or tab >>Gesture Recognition System for Isolated Sign Language Signs
2016 (English)Conference paper, Published paper (Refereed)
National Category
Computer Sciences
Identifiers
urn:nbn:se:kth:diva-193939 (URN)
Conference
The 4th European and 7th Nordic Symposium on Multimodal Communication, 29-30 September 2016, University of Copenhagen, Denmark
Note

QC 20161013

Available from: 2016-10-12 Created: 2016-10-12 Last updated: 2024-03-18Bibliographically approved
Stefanov, K., Sugimoto, A. & Beskow, J. (2016). Look Who’s Talking: Visual Identification of the Active Speaker in Multi-party Human-robot Interaction. In: : . Paper presented at 2nd Workshop on Advancements in Social Signal Processing for Multimodal Interaction 2016, ASSP4MI 2016 - Held in conjunction with the 18th ACM International Conference on Multimodal Interaction 2016, ICMI 2016 (pp. 22-27). Association for Computing Machinery (ACM)
Open this publication in new window or tab >>Look Who’s Talking: Visual Identification of the Active Speaker in Multi-party Human-robot Interaction
2016 (English)Conference paper, Published paper (Refereed)
Abstract [en]

This paper presents analysis of a previously recorded multimodal interaction dataset. The primary purpose of that dataset is to explore patterns in the focus of visual attention of humans under three different conditions - two humans involved in task-based interaction with a robot; the same two humans involved in task-based interaction where the robot is replaced by a third human, and a free three-party human interaction. The paper presents a data-driven methodology for automatic visual identification of the active speaker based on facial action units (AUs). The paper also presents an evaluation of the proposed methodology on 12 different interactions with an approximate length of 4 hours. The methodology will be implemented on a robot and used to generate natural focus of visual attention behavior during multi-party human-robot interactions

Place, publisher, year, edition, pages
Association for Computing Machinery (ACM), 2016
Keywords
Active speaker identification, Human-robot interaction, Multi-modal interaction
National Category
Computer Sciences
Identifiers
urn:nbn:se:kth:diva-193940 (URN)10.1145/3005467.3005470 (DOI)2-s2.0-85006754103 (Scopus ID)
Conference
2nd Workshop on Advancements in Social Signal Processing for Multimodal Interaction 2016, ASSP4MI 2016 - Held in conjunction with the 18th ACM International Conference on Multimodal Interaction 2016, ICMI 2016
Note

QCR 20161013

QC 20170314

Available from: 2016-10-12 Created: 2016-10-12 Last updated: 2024-03-18Bibliographically approved
Chollet, M., Stefanov, K., Prendinger, H. & Scherer, S. (2015). Public Speaking Training with a Multimodal Interactive Virtual Audience Framework. In: ICMI '15 Proceedings of the 2015 ACM on International Conference on Multimodal Interaction: . Paper presented at 17th ACM International Conference on Multimodal Interaction ICMI 2015,New York, NY (pp. 367-368). ACM Digital Library
Open this publication in new window or tab >>Public Speaking Training with a Multimodal Interactive Virtual Audience Framework
2015 (English)In: ICMI '15 Proceedings of the 2015 ACM on International Conference on Multimodal Interaction, ACM Digital Library, 2015, p. 367-368Conference paper, Published paper (Refereed)
Abstract [en]

We have developed an interactive virtual audience platform for public speaking training. Users' public speaking behavior is automatically analyzed using multimodal sensors, and ultimodal feedback is produced by virtual characters and generic visual widgets depending on the user's behavior. The flexibility of our system allows to compare different interaction mediums (e.g. virtual reality vs normal interaction), social situations (e.g. one-on-one meetings vs large audiences) and trained behaviors (e.g. general public speaking performance vs specific behaviors).

Place, publisher, year, edition, pages
ACM Digital Library, 2015
National Category
Computer Systems
Identifiers
urn:nbn:se:kth:diva-180569 (URN)10.1145/2818346.2823294 (DOI)000380609500058 ()2-s2.0-84959308165 (Scopus ID)
Conference
17th ACM International Conference on Multimodal Interaction ICMI 2015,New York, NY
Note

QC 20160125

Available from: 2016-01-19 Created: 2016-01-19 Last updated: 2022-06-23Bibliographically approved
Meena, R., Dabbaghchian, S. & Stefanov, K. (2014). A Data-driven Approach to Detection of Interruptions in Human-–human Conversations. In: : . Paper presented at FONETIK, Stockholm, Sweden.
Open this publication in new window or tab >>A Data-driven Approach to Detection of Interruptions in Human-–human Conversations
2014 (English)Conference paper, Published paper (Refereed)
National Category
Computer Sciences
Identifiers
urn:nbn:se:kth:diva-158181 (URN)
Conference
FONETIK, Stockholm, Sweden
Note

QC 20161017

Available from: 2014-12-30 Created: 2014-12-30 Last updated: 2022-06-23Bibliographically approved
Al Moubayed, S., Beskow, J., Bollepalli, B., Gustafson, J., Hussen-Abdelaziz, A., Johansson, M., . . . Varol, G. (2014). Human-robot Collaborative Tutoring Using Multiparty Multimodal Spoken Dialogue. In: : . Paper presented at 9th Annual ACM/IEEE International Conference on Human-Robot Interaction, Bielefeld, Germany. IEEE conference proceedings
Open this publication in new window or tab >>Human-robot Collaborative Tutoring Using Multiparty Multimodal Spoken Dialogue
Show others...
2014 (English)Conference paper, Published paper (Refereed)
Abstract [en]

In this paper, we describe a project that explores a novel experi-mental setup towards building a spoken, multi-modally rich, and human-like multiparty tutoring robot. A human-robotinteraction setup is designed, and a human-human dialogue corpus is collect-ed. The corpus targets the development of a dialogue system platform to study verbal and nonverbaltutoring strategies in mul-tiparty spoken interactions with robots which are capable of spo-ken dialogue. The dialogue task is centered on two participants involved in a dialogueaiming to solve a card-ordering game. Along with the participants sits a tutor (robot) that helps the par-ticipants perform the task, and organizes and balances their inter-action. Differentmultimodal signals captured and auto-synchronized by different audio-visual capture technologies, such as a microphone array, Kinects, and video cameras, were coupled with manual annotations. These are used build a situated model of the interaction based on the participants personalities, their state of attention, their conversational engagement and verbal domi-nance, and how that is correlated with the verbal and visual feed-back, turn-management, and conversation regulatory actions gen-erated by the tutor. Driven by the analysis of the corpus, we will show also the detailed design methodologies for an affective, and multimodally rich dialogue system that allows the robot to meas-ure incrementally the attention states, and the dominance for each participant, allowing the robot head Furhat to maintain a well-coordinated, balanced, and engaging conversation, that attempts to maximize the agreement and the contribution to solve the task. This project sets the first steps to explore the potential of us-ing multimodal dialogue systems to build interactive robots that can serve in educational, team building, and collaborative task solving applications.

Place, publisher, year, edition, pages
IEEE conference proceedings, 2014
Keywords
Furhat robot; Human-robot collaboration; Human-robot interaction; Multiparty interaction; Spoken dialog
National Category
Computer Sciences
Identifiers
urn:nbn:se:kth:diva-145511 (URN)10.1145/2559636.2563681 (DOI)000455229400029 ()2-s2.0-84896934381 (Scopus ID)
Conference
9th Annual ACM/IEEE International Conference on Human-Robot Interaction, Bielefeld, Germany
Note

QC 20161018

Available from: 2014-05-21 Created: 2014-05-21 Last updated: 2024-03-15Bibliographically approved
Organisations
Identifiers
ORCID iD: ORCID iD iconorcid.org/0000-0002-0861-8660

Search in DiVA

Show all publications