kth.sePublications
Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Vision-based Active Speaker Detection in Multiparty Interaction
KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH. (TMH)ORCID iD: 0000-0002-3323-5311
2017 (English)In: Grounding Language Understanding, 2017Conference paper, Published paper (Refereed)
Place, publisher, year, edition, pages
2017.
National Category
Computer graphics and computer vision
Identifiers
URN: urn:nbn:se:kth:diva-211651DOI: 10.21437/GLU.2017-10OAI: oai:DiVA.org:kth-211651DiVA, id: diva2:1130226
Conference
Grounding Language Understanding GLU2017 August 25, 2017, KTH Royal Institute of Technology, Stockholm, Sweden
Note

QC 20170809

Available from: 2017-08-08 Created: 2017-08-08 Last updated: 2025-02-07Bibliographically approved
In thesis
1. Recognition and Generation of Communicative Signals: Modeling of Hand Gestures, Speech Activity and Eye-Gaze in Human-Machine Interaction
Open this publication in new window or tab >>Recognition and Generation of Communicative Signals: Modeling of Hand Gestures, Speech Activity and Eye-Gaze in Human-Machine Interaction
2018 (English)Doctoral thesis, comprehensive summary (Other academic)
Abstract [en]

Nonverbal communication is essential for natural and effective face-to-face human-human interaction. It is the process of communicating through sending and receiving wordless (mostly visual, but also auditory) signals between people. Consequently, a natural and effective face-to-face human-machine interaction requires machines (e.g., robots) to understand and produce such human-like signals. There are many types of nonverbal signals used in this form of communication including, body postures, hand gestures, facial expressions, eye movements, touches and uses of space. This thesis investigates two of these nonverbal signals: hand gestures and eye-gaze. The main goal of the thesis is to propose computational methods for real-time recognition and generation of these two signals in order to facilitate natural and effective human-machine interaction.

The first topic addressed in the thesis is the real-time recognition of hand gestures and its application to recognition of isolated sign language signs. Hand gestures can also provide important cues during human-robot interaction, for example, emblems are type of hand gestures with specific meaning used to substitute spoken words. The thesis has two main contributions with respect to the recognition of hand gestures: 1) a newly collected dataset of isolated Swedish Sign Language signs, and 2) a real-time hand gestures recognition method.

The second topic addressed in the thesis is the general problem of real-time speech activity detection in noisy and dynamic environments and its application to socially-aware language acquisition. Speech activity can also provide important information during human-robot interaction, for example, the current active speaker's hand gestures and eye-gaze direction or head orientation can play an important role in understanding the state of the interaction. The thesis has one main contribution with respect to speech activity detection: a real-time vision-based speech activity detection method.

The third topic addressed in the thesis is the real-time generation of eye-gaze direction or head orientation and its application to human-robot interaction. Eye-gaze direction or head orientation can provide important cues during human-robot interaction, for example, it can regulate who is allowed to speak when and coordinate the changes in the roles on the conversational floor (e.g., speaker, addressee, and bystander). The thesis has two main contributions with respect to the generation of eye-gaze direction or head orientation: 1) a newly collected dataset of face-to-face interactions, and 2) a real-time eye-gaze direction or head orientation generation method.

Place, publisher, year, edition, pages
KTH Royal Institute of Technology, 2018. p. 54
Series
TRITA-EECS-AVL ; 2018:46
National Category
Computer and Information Sciences
Identifiers
urn:nbn:se:kth:diva-227986 (URN)978-91-7729-810-6 (ISBN)
Public defence
2018-06-07, Hörsal K2, Teknikringen 28, Stockholm, 14:00 (English)
Opponent
Supervisors
Note

QC 20180516

Available from: 2018-05-16 Created: 2018-05-16 Last updated: 2022-06-26Bibliographically approved

Open Access in DiVA

fulltext(758 kB)436 downloads
File information
File name FULLTEXT01.pdfFile size 758 kBChecksum SHA-512
7bdf02f5ed58bed8872fb215ee1fd6d3badcfb166e903f3676273cb38b69a641a465a42fa6d34abaf3720ae82a806f4d5abafeb51b1ab966e5a860ec1519fc85
Type fulltextMimetype application/pdf

Other links

Publisher's full text

Authority records

Stefanov, KalinBeskow, JonasSalvi, Giampiero

Search in DiVA

By author/editor
Stefanov, KalinBeskow, JonasSalvi, Giampiero
By organisation
Speech Communication and TechnologySpeech, Music and Hearing, TMH
Computer graphics and computer vision

Search outside of DiVA

GoogleGoogle Scholar
Total: 436 downloads
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

doi
urn-nbn

Altmetric score

doi
urn-nbn
Total: 849 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf