Change search
Link to record
Permanent link

Direct link
BETA
Publications (10 of 16) Show all publications
Kontogiorgos, D., Avramova, V., Alexanderson, S., Jonell, P., Oertel, C., Beskow, J., . . . Gustafson, J. (2018). A Multimodal Corpus for Mutual Gaze and Joint Attention in Multiparty Situated Interaction. In: Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018): . Paper presented at International Conference on Language Resources and Evaluation (LREC 2018) (pp. 119-127). Paris
Open this publication in new window or tab >>A Multimodal Corpus for Mutual Gaze and Joint Attention in Multiparty Situated Interaction
Show others...
2018 (English)In: Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), Paris, 2018, p. 119-127Conference paper, Published paper (Refereed)
Abstract [en]

In this paper we present a corpus of multiparty situated interaction where participants collaborated on moving virtual objects on a large touch screen. A moderator facilitated the discussion and directed the interaction. The corpus contains recordings of a variety of multimodal data, in that we captured speech, eye gaze and gesture data using a multisensory setup (wearable eye trackers, motion capture and audio/video). Furthermore, in the description of the multimodal corpus, we investigate four different types of social gaze: referential gaze, joint attention, mutual gaze and gaze aversion by both perspectives of a speaker and a listener. We annotated the groups’ object references during object manipulation tasks and analysed the group’s proportional referential eye-gaze with regards to the referent object. When investigating the distributions of gaze during and before referring expressions we could corroborate the differences in time between speakers’ and listeners’ eye gaze found in earlier studies. This corpus is of particular interest to researchers who are interested in social eye-gaze patterns in turn-taking and referring language in situated multi-party interaction.

Place, publisher, year, edition, pages
Paris: , 2018
National Category
Engineering and Technology
Identifiers
urn:nbn:se:kth:diva-230238 (URN)979-10-95546-00-9 (ISBN)
Conference
International Conference on Language Resources and Evaluation (LREC 2018)
Note

QC 20180614

Available from: 2018-06-13 Created: 2018-06-13 Last updated: 2018-06-14Bibliographically approved
Karipidou, K., Ahnlund, J., Friberg, A., Alexanderson, S. & Kjellström, H. (2017). Computer Analysis of Sentiment Interpretation in Musical Conducting. In: Proceedings - 12th IEEE International Conference on Automatic Face and Gesture Recognition, FG 2017: . Paper presented at 12th IEEE International Conference on Automatic Face and Gesture Recognition, FG 2017, Washington, United States, 30 May 2017 through 3 June 2017 (pp. 400-405). IEEE, Article ID 7961769.
Open this publication in new window or tab >>Computer Analysis of Sentiment Interpretation in Musical Conducting
Show others...
2017 (English)In: Proceedings - 12th IEEE International Conference on Automatic Face and Gesture Recognition, FG 2017, IEEE, 2017, p. 400-405, article id 7961769Conference paper, Published paper (Refereed)
Abstract [en]

This paper presents a unique dataset consisting of 20 recordings of the same musical piece, conducted with 4 different musical intentions in mind. The upper body and baton motion of a professional conductor was recorded, as well as the sound of each instrument in a professional string quartet following the conductor. The dataset is made available for benchmarking of motion recognition algorithms. An HMM-based emotion intent classification method is trained with subsets of the data, and classification of other subsets of the data show firstly that the motion of the baton communicates energetic intention to a high degree, secondly, that the conductor’s torso, head and other arm conveys calm intention to a high degree, and thirdly, that positive vs negative sentiments are communicated to a high degree through other channels than the body and baton motion – most probably, through facial expression and muscle tension conveyed through articulated hand and finger motion. The long-term goal of this work is to develop a computer model of the entire conductor-orchestra communication pro- cess; the studies presented here indicate that computer modeling of the conductor-orchestra communication is feasible.

Place, publisher, year, edition, pages
IEEE, 2017
National Category
Signal Processing
Identifiers
urn:nbn:se:kth:diva-208886 (URN)10.1109/FG.2017.57 (DOI)000414287400054 ()2-s2.0-85026288976 (Scopus ID)9781509040230 (ISBN)
Conference
12th IEEE International Conference on Automatic Face and Gesture Recognition, FG 2017, Washington, United States, 30 May 2017 through 3 June 2017
Note

QC 20170616

Available from: 2017-06-12 Created: 2017-06-12 Last updated: 2017-11-20Bibliographically approved
Alexanderson, S., House, D. & Beskow, J. (2016). Automatic annotation of gestural units in spontaneous face-to-face interaction. In: MA3HMI 2016 - Proceedings of the Workshop on Multimodal Analyses Enabling Artificial Agents in Human-Machine Interaction: . Paper presented at 2016 Workshop on Multimodal Analyses Enabling Artificial Agents in Human-Machine Interaction, MA3HMI 2016, 12 November 2016 through 16 November 2016 (pp. 15-19).
Open this publication in new window or tab >>Automatic annotation of gestural units in spontaneous face-to-face interaction
2016 (English)In: MA3HMI 2016 - Proceedings of the Workshop on Multimodal Analyses Enabling Artificial Agents in Human-Machine Interaction, 2016, p. 15-19Conference paper, Published paper (Refereed)
Abstract [en]

Speech and gesture co-occur in spontaneous dialogue in a highly complex fashion. There is a large variability in the motion that people exhibit during a dialogue, and different kinds of motion occur during different states of the interaction. A wide range of multimodal interface applications, for example in the fields of virtual agents or social robots, can be envisioned where it is important to be able to automatically identify gestures that carry information and discriminate them from other types of motion. While it is easy for a human to distinguish and segment manual gestures from a flow of multimodal information, the same task is not trivial to perform for a machine. In this paper we present a method to automatically segment and label gestural units from a stream of 3D motion capture data. The gestural flow is modeled with a 2-level Hierarchical Hidden Markov Model (HHMM) where the sub-states correspond to gesture phases. The model is trained based on labels of complete gesture units and self-adaptive manipulators. The model is tested and validated on two datasets differing in genre and in method of capturing motion, and outperforms a state-of-the-art SVM classifier on a publicly available dataset.

Keywords
Gesture recognition, Motion capture, Spontaneous dialogue, Hidden Markov models, Man machine systems, Markov processes, Online systems, 3D motion capture, Automatic annotation, Face-to-face interaction, Hierarchical hidden markov models, Multi-modal information, Multi-modal interfaces, Classification (of information)
National Category
Robotics
Identifiers
urn:nbn:se:kth:diva-202135 (URN)10.1145/3011263.3011268 (DOI)2-s2.0-85003571594 (Scopus ID)9781450345620 (ISBN)
Conference
2016 Workshop on Multimodal Analyses Enabling Artificial Agents in Human-Machine Interaction, MA3HMI 2016, 12 November 2016 through 16 November 2016
Funder
Swedish Research Council, 2010-4646
Note

Funding text: The work reported here is carried out within the projects: "Timing of intonation and gestures in spoken communication," (P12-0634:1) funded by the Bank of Sweden Tercentenary Foundation, and "Large-scale massively multimodal modelling of non-verbal behaviour in spontaneous dialogue," (VR 2010-4646) funded by Swedish Research Council.

Available from: 2017-03-13 Created: 2017-03-13 Last updated: 2017-11-24Bibliographically approved
Zellers, M., House, D. & Alexanderson, S. (2016). Prosody and hand gesture at turn boundaries in Swedish. In: Proceedings of the International Conference on Speech Prosody: . Paper presented at 8th Speech Prosody 2016, 31 May 2016 through 3 June 2016 (pp. 831-835). International Speech Communications Association
Open this publication in new window or tab >>Prosody and hand gesture at turn boundaries in Swedish
2016 (English)In: Proceedings of the International Conference on Speech Prosody, International Speech Communications Association , 2016, p. 831-835Conference paper, Published paper (Refereed)
Abstract [en]

In order to ensure smooth turn-taking between conversational participants, interlocutors must have ways of providing information to one another about whether they have finished speaking or intend to continue. The current work investigates Swedish speakers’ use of hand gestures in conjunction with turn change or turn hold in unrestricted, spontaneous speech. As has been reported by other researchers, we find that speakers’ gestures end before the end of speech in cases of turn change, while they may extend well beyond the end of a given speech chunk in the case of turn hold. We investigate the degree to which prosodic cues and gesture cues to turn transition in Swedish face-to-face conversation are complementary versus functioning additively. The co-occurrence of acoustic prosodic features and gesture at potential turn boundaries gives strong support for considering hand gestures as part of the prosodic system, particularly in the context of discourse-level information such as maintaining smooth turn transition.

Place, publisher, year, edition, pages
International Speech Communications Association, 2016
Keywords
Gesture, Multimodal communication, Swedish, Turn transition, Co-occurrence, Face-to-face conversation, Multimodal communications, Prosodic features, Smooth turn-taking, Spontaneous speech, Swedishs, Speech
National Category
Language Technology (Computational Linguistics)
Identifiers
urn:nbn:se:kth:diva-195492 (URN)2-s2.0-84982980451 (Scopus ID)
Conference
8th Speech Prosody 2016, 31 May 2016 through 3 June 2016
Note

QC 20161125

Available from: 2016-11-25 Created: 2016-11-03 Last updated: 2018-01-13Bibliographically approved
House, D., Alexanderson, S. & Beskow, J. (2015). On the temporal domain of co-speech gestures: syllable, phrase or talk spurt?. In: Lundmark Svensson, M.; Ambrazaitis, G.; van de Weijer, J. (Ed.), Proceedings of Fonetik 2015: . Paper presented at Fonetik 2015, Lund (pp. 63-68).
Open this publication in new window or tab >>On the temporal domain of co-speech gestures: syllable, phrase or talk spurt?
2015 (English)In: Proceedings of Fonetik 2015 / [ed] Lundmark Svensson, M.; Ambrazaitis, G.; van de Weijer, J., 2015, p. 63-68Conference paper, Published paper (Other academic)
Abstract [en]

This study explores the use of automatic methods to detect and extract handgesture movement co-occuring with speech. Two spontaneous dyadic dialogueswere analyzed using 3D motion-capture techniques to track hand movement.Automatic speech/non-speech detection was performed on the dialogues resultingin a series of connected talk spurts for each speaker. Temporal synchrony of onsetand offset of gesture and speech was studied between the automatic hand gesturetracking and talk spurts, and compared to an earlier study of head nods andsyllable synchronization. The results indicated onset synchronization between headnods and the syllable in the short temporal domain and between the onset of longergesture units and the talk spurt in a more extended temporal domain.

National Category
Computer Sciences Language Technology (Computational Linguistics)
Identifiers
urn:nbn:se:kth:diva-180407 (URN)
Conference
Fonetik 2015, Lund
Note

QC 20160216

Available from: 2016-01-13 Created: 2016-01-13 Last updated: 2018-01-10Bibliographically approved
Alexanderson, S. & Beskow, J. (2015). Towards Fully Automated Motion Capture of Signs -- Development and Evaluation of a Key Word Signing Avatar. ACM Transactions on Accessible Computing, 7(2), 7:1-7:17
Open this publication in new window or tab >>Towards Fully Automated Motion Capture of Signs -- Development and Evaluation of a Key Word Signing Avatar
2015 (English)In: ACM Transactions on Accessible Computing, ISSN 1936-7228, Vol. 7, no 2, p. 7:1-7:17Article in journal (Refereed) Published
Abstract [en]

Motion capture of signs provides unique challenges in the field of multimodal data collection. The dense packaging of visual information requires high fidelity and high bandwidth of the captured data. Even though marker-based optical motion capture provides many desirable features such as high accuracy, global fitting, and the ability to record body and face simultaneously, it is not widely used to record finger motion, especially not for articulated and syntactic motion such as signs. Instead, most signing avatar projects use costly instrumented gloves, which require long calibration procedures. In this article, we evaluate the data quality obtained from optical motion capture of isolated signs from Swedish sign language with a large number of low-cost cameras. We also present a novel dual-sensor approach to combine the data with low-cost, five-sensor instrumented gloves to provide a recording method with low manual postprocessing. Finally, we evaluate the collected data and the dual-sensor approach as transferred to a highly stylized avatar. The application of the avatar is a game-based environment for training Key Word Signing (KWS) as augmented and alternative communication (AAC), intended for children with communication disabilities.

Place, publisher, year, edition, pages
New York, NY, USA: Association for Computing Machinery (ACM), 2015
Keywords
Augmentative and alternative communication (AAC), Motion capture, Sign language, Virtual characters
National Category
Computer Sciences Language Technology (Computational Linguistics)
Identifiers
urn:nbn:se:kth:diva-180427 (URN)10.1145/2764918 (DOI)000360070800004 ()2-s2.0-84935145760 (Scopus ID)
Note

 QC 2016-01-13

Available from: 2016-01-13 Created: 2016-01-13 Last updated: 2018-01-10Bibliographically approved
Alexanderson, S. & Beskow, J. (2014). Animated Lombard speech: Motion capture, facial animation and visual intelligibility of speech produced in adverse conditions. Computer speech & language (Print), 28(2), 607-618
Open this publication in new window or tab >>Animated Lombard speech: Motion capture, facial animation and visual intelligibility of speech produced in adverse conditions
2014 (English)In: Computer speech & language (Print), ISSN 0885-2308, E-ISSN 1095-8363, Vol. 28, no 2, p. 607-618Article in journal (Refereed) Published
Abstract [en]

In this paper we study the production and perception of speech in diverse conditions for the purposes of accurate, flexible and highly intelligible talking face animation. We recorded audio, video and facial motion capture data of a talker uttering a,set of 180 short sentences, under three conditions: normal speech (in quiet), Lombard speech (in noise), and whispering. We then produced an animated 3D avatar with similar shape and appearance as the original talker and used an error minimization procedure to drive the animated version of the talker in a way that matched the original performance as closely as possible. In a perceptual intelligibility study with degraded audio we then compared the animated talker against the real talker and the audio alone, in terms of audio-visual word recognition rate across the three different production conditions. We found that the visual intelligibility of the animated talker was on par with the real talker for the Lombard and whisper conditions. In addition we created two incongruent conditions where normal speech audio was paired with animated Lombard speech or whispering. When compared to the congruent normal speech condition, Lombard animation yields a significant increase in intelligibility, despite the AV-incongruence. In a separate evaluation, we gathered subjective opinions on the different animations, and found that some degree of incongruence was generally accepted.

Keywords
Lombard effect, Motion capture, Speech-reading, Lip-reading, Facial animation, Audio-visual intelligibility
National Category
Language Technology (Computational Linguistics)
Identifiers
urn:nbn:se:kth:diva-141052 (URN)10.1016/j.csl.2013.02.005 (DOI)000329415400017 ()2-s2.0-84890567121 (Scopus ID)
Funder
Swedish Research Council, VR 2010-4646
Note

QC 20140212

Available from: 2014-02-12 Created: 2014-02-07 Last updated: 2018-01-11Bibliographically approved
Beskow, J., Alexanderson, S., Stefanov, K., Claesson, B., Derbring, S., Fredriksson, M., . . . Axelsson, E. (2014). Tivoli - Learning Signs Through Games and Interaction for Children with Communicative Disorders. In: : . Paper presented at 6th Biennial Conference of the International Society for Augmentative and Alternative Communication, Lisbon, Portugal.
Open this publication in new window or tab >>Tivoli - Learning Signs Through Games and Interaction for Children with Communicative Disorders
Show others...
2014 (English)Conference paper, Published paper (Refereed)
National Category
Computer Sciences
Identifiers
urn:nbn:se:kth:diva-165816 (URN)
Conference
6th Biennial Conference of the International Society for Augmentative and Alternative Communication, Lisbon, Portugal
Note

QC 20161018

Available from: 2015-04-29 Created: 2015-04-29 Last updated: 2018-01-11Bibliographically approved
Alexanderson, S., House, D. & Beskow, J. (2013). Aspects of co-occurring syllables and head nods in spontaneous dialogue. In: Proceedings of 12th International Conference on Auditory-Visual Speech Processing (AVSP2013): . Paper presented at 12th International Conference on Auditory-Visual Speech Processing (AVSP2013), Annecy, France, from August 29th to September 1st, 2013 (pp. 169-172).
Open this publication in new window or tab >>Aspects of co-occurring syllables and head nods in spontaneous dialogue
2013 (English)In: Proceedings of 12th International Conference on Auditory-Visual Speech Processing (AVSP2013), 2013, p. 169-172Conference paper, Published paper (Refereed)
Abstract [en]

This paper reports on the extraction and analysis of head nods taken from motion capture data of spontaneous dialogue in Swedish. The head nods were extracted automatically and then manually classified in terms of gestures having a beat function or multifunctional gestures. Prosodic features were extracted from syllables co-occurring with the beat gestures. While the peak rotation of the nod is on average aligned with the stressed syllable, the results show considerable variation in fine temporal synchronization. The syllables co-occurring with the gestures generally show greater intensity, higher F0, and greater F0 range when compared to the mean across the entire dialogue. A functional analysis shows that the majority of the syllables belong to words bearing a focal accent.

National Category
Computer Sciences Language Technology (Computational Linguistics)
Identifiers
urn:nbn:se:kth:diva-137402 (URN)
Conference
12th International Conference on Auditory-Visual Speech Processing (AVSP2013), Annecy, France, from August 29th to September 1st, 2013
Note

QC 20140604

Available from: 2013-12-13 Created: 2013-12-13 Last updated: 2018-01-11Bibliographically approved
Alexanderson, S., House, D. & Beskow, J. (2013). Extracting and analysing co-speech head gestures from motion-capture data. In: Eklund, Robert (Ed.), Proceedings of Fonetik 2013: . Paper presented at Fonetik 2013, The XXVIth Annual Phonetics Meeting 12–13 June 2013, Linköping University, Linköping, Sweden (pp. 1-4). Linköping University Electronic Press
Open this publication in new window or tab >>Extracting and analysing co-speech head gestures from motion-capture data
2013 (English)In: Proceedings of Fonetik 2013 / [ed] Eklund, Robert, Linköping University Electronic Press, 2013, p. 1-4Conference paper, Published paper (Refereed)
Place, publisher, year, edition, pages
Linköping University Electronic Press, 2013
Series
Studies in language and culture, ISSN 1403-2570 ; 21
National Category
Computer Sciences Language Technology (Computational Linguistics)
Identifiers
urn:nbn:se:kth:diva-137401 (URN)9789175195797 (ISBN)
Conference
Fonetik 2013, The XXVIth Annual Phonetics Meeting 12–13 June 2013, Linköping University, Linköping, Sweden
Note

QC 20140604

Available from: 2013-12-13 Created: 2013-12-13 Last updated: 2018-01-11Bibliographically approved
Organisations
Identifiers
ORCID iD: ORCID iD iconorcid.org/0000-0002-7801-7617

Search in DiVA

Show all publications