Change search
Link to record
Permanent link

Direct link
BETA
Alternative names
Publications (10 of 81) Show all publications
Klasson, M., Zhang, C. & Kjellström, H. (2019). A hierarchical grocery store image dataset with visual and semantic labels. In: Proceedings - 2019 IEEE Winter Conference on Applications of Computer Vision, WACV 2019: . Paper presented at 19th IEEE Winter Conference on Applications of Computer Vision, WACV 2019, 7 January 2019 through 11 January 2019 (pp. 491-500). Institute of Electrical and Electronics Engineers (IEEE), Article ID 8658240.
Open this publication in new window or tab >>A hierarchical grocery store image dataset with visual and semantic labels
2019 (English)In: Proceedings - 2019 IEEE Winter Conference on Applications of Computer Vision, WACV 2019, Institute of Electrical and Electronics Engineers (IEEE), 2019, p. 491-500, article id 8658240Conference paper, Published paper (Refereed)
Abstract [en]

Image classification models built into visual support systems and other assistive devices need to provide accurate predictions about their environment. We focus on an application of assistive technology for people with visual impairments, for daily activities such as shopping or cooking. In this paper, we provide a new benchmark dataset for a challenging task in this application – classification of fruits, vegetables, and refrigerated products, e.g. milk packages and juice cartons, in grocery stores. To enable the learning process to utilize multiple sources of structured information, this dataset not only contains a large volume of natural images but also includes the corresponding information of the product from an online shopping website. Such information encompasses the hierarchical structure of the object classes, as well as an iconic image of each type of object. This dataset can be used to train and evaluate image classification models for helping visually impaired people in natural environments. Additionally, we provide benchmark results evaluated on pretrained convolutional neural networks often used for image understanding purposes, and also a multi-view variational autoencoder, which is capable of utilizing the rich product information in the dataset.

Place, publisher, year, edition, pages
Institute of Electrical and Electronics Engineers (IEEE), 2019
Keywords
Benchmarking, Computer vision, Electronic commerce, Image classification, Large dataset, Learning systems, Neural networks, Semantics, Accurate prediction, Assistive technology, Classification models, Convolutional neural network, Hierarchical structures, Natural environments, Structured information, Visually impaired people, Classification (of information)
National Category
Computer Vision and Robotics (Autonomous Systems)
Identifiers
urn:nbn:se:kth:diva-252223 (URN)10.1109/WACV.2019.00058 (DOI)000469423400051 ()2-s2.0-85063566822 (Scopus ID)9781728119755 (ISBN)
Conference
19th IEEE Winter Conference on Applications of Computer Vision, WACV 2019, 7 January 2019 through 11 January 2019
Note

QC 20190611

Available from: 2019-06-11 Created: 2019-06-11 Last updated: 2019-06-26Bibliographically approved
Zhang, C., Butepage, J., Kjellström, H. & Mandt, S. (2019). Advances in Variational Inference. IEEE Transaction on Pattern Analysis and Machine Intelligence, 41(8), 2008-2026
Open this publication in new window or tab >>Advances in Variational Inference
2019 (English)In: IEEE Transaction on Pattern Analysis and Machine Intelligence, ISSN 0162-8828, E-ISSN 1939-3539, Vol. 41, no 8, p. 2008-2026Article in journal (Refereed) Published
Abstract [en]

Many modern unsupervised or semi-supervised machine learning algorithms rely on Bayesian probabilistic models. These models are usually intractable and thus require approximate inference. Variational inference (VI) lets us approximate a high-dimensional Bayesian posterior with a simpler variational distribution by solving an optimization problem. This approach has been successfully applied to various models and large-scale applications. In this review, we give an overview of recent trends in variational inference. We first introduce standard mean field variational inference, then review recent advances focusing on the following aspects: (a) scalable VI, which includes stochastic approximations, (b) generic VI, which extends the applicability of VI to a large class of otherwise intractable models, such as non-conjugate models, mean field approximation or with atypical divergences, and (d) amortized VI, which implements the inference over local latent variables with inference networks. Finally, we provide a summary of promising future research directions.

Place, publisher, year, edition, pages
IEEE COMPUTER SOC, 2019
Keywords
Variational inference, approximate Bayesian inference, reparameterization gradients, structured variational approximations, scalable inference, inference networks
National Category
Computational Mathematics
Identifiers
urn:nbn:se:kth:diva-255405 (URN)10.1109/TPAMI.2018.2889774 (DOI)000473598800016 ()30596568 (PubMedID)2-s2.0-85059288228 (Scopus ID)
Note

QC 20190814

Available from: 2019-08-14 Created: 2019-08-14 Last updated: 2019-08-14Bibliographically approved
Kucherenko, T., Hasegawa, D., Henter, G. E., Kaneko, N. & Kjellström, H. (2019). Analyzing Input and Output Representations for Speech-Driven Gesture Generation. In: 19th ACM International Conference on Intelligent Virtual Agents: . Paper presented at 19th ACM International Conference on Intelligent Virtual Agents (IVA '19),July 2-5,2019,Paris, France. New York, NY, USA: ACM Publications
Open this publication in new window or tab >>Analyzing Input and Output Representations for Speech-Driven Gesture Generation
Show others...
2019 (English)In: 19th ACM International Conference on Intelligent Virtual Agents, New York, NY, USA: ACM Publications, 2019Conference paper, Published paper (Refereed)
Abstract [en]

This paper presents a novel framework for automatic speech-driven gesture generation, applicable to human-agent interaction including both virtual agents and robots. Specifically, we extend recent deep-learning-based, data-driven methods for speech-driven gesture generation by incorporating representation learning. Our model takes speech as input and produces gestures as output, in the form of a sequence of 3D coordinates.

Our approach consists of two steps. First, we learn a lower-dimensional representation of human motion using a denoising autoencoder neural network, consisting of a motion encoder MotionE and a motion decoder MotionD. The learned representation preserves the most important aspects of the human pose variation while removing less relevant variation. Second, we train a novel encoder network SpeechE to map from speech to a corresponding motion representation with reduced dimensionality. At test time, the speech encoder and the motion decoder networks are combined: SpeechE predicts motion representations based on a given speech signal and MotionD then decodes these representations to produce motion sequences.

We evaluate different representation sizes in order to find the most effective dimensionality for the representation. We also evaluate the effects of using different speech features as input to the model. We find that mel-frequency cepstral coefficients (MFCCs), alone or combined with prosodic features, perform the best. The results of a subsequent user study confirm the benefits of the representation learning.

Place, publisher, year, edition, pages
New York, NY, USA: ACM Publications, 2019
Keywords
Gesture generation, social robotics, representation learning, neural network, deep learning, gesture synthesis, virtual agents
National Category
Human Computer Interaction
Research subject
Human-computer Interaction
Identifiers
urn:nbn:se:kth:diva-255035 (URN)10.1145/3308532.3329472 (DOI)2-s2.0-85069654899 (Scopus ID)978-1-4503-6672-4 (ISBN)
Conference
19th ACM International Conference on Intelligent Virtual Agents (IVA '19),July 2-5,2019,Paris, France
Projects
EACare
Funder
Swedish Foundation for Strategic Research , RIT15-0107
Note

QC 20190902

Available from: 2019-07-16 Created: 2019-07-16 Last updated: 2019-09-02Bibliographically approved
Eriksson, S., Unander-Scharin, Å., Trichon, V., Unander-Scharin, C., Kjellström, H. & Höök, K. (2019). Dancing with Drones: Crafting Novel Artistic Expressions through Intercorporeality. In: Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems: . Paper presented at ACM SIGCHI (pp. 617:1-617:12). New York, NY USA
Open this publication in new window or tab >>Dancing with Drones: Crafting Novel Artistic Expressions through Intercorporeality
Show others...
2019 (English)In: Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems, New York, NY USA, 2019, p. 617:1-617:12Conference paper, Published paper (Refereed)
Place, publisher, year, edition, pages
New York, NY USA: , 2019
National Category
Engineering and Technology
Identifiers
urn:nbn:se:kth:diva-257746 (URN)10.1145/3290605.3300847 (DOI)000474467907074 ()2-s2.0-85067597620 (Scopus ID)
Conference
ACM SIGCHI
Projects
KAW 2015.0080, Engineering theInterconnected Society: Information, Control, Interaction
Note

QC 20190916

Available from: 2019-09-03 Created: 2019-09-03 Last updated: 2019-09-26Bibliographically approved
Stefanov, K., Salvi, G., Kontogiorgos, D., Kjellström, H. & Beskow, J. (2019). Modeling of Human Visual Attention in Multiparty Open-World Dialogues. ACM TRANSACTIONS ON HUMAN-ROBOT INTERACTION, 8(2), Article ID UNSP 8.
Open this publication in new window or tab >>Modeling of Human Visual Attention in Multiparty Open-World Dialogues
Show others...
2019 (English)In: ACM TRANSACTIONS ON HUMAN-ROBOT INTERACTION, ISSN 2573-9522, Vol. 8, no 2, article id UNSP 8Article in journal (Refereed) Published
Abstract [en]

This study proposes, develops, and evaluates methods for modeling the eye-gaze direction and head orientation of a person in multiparty open-world dialogues, as a function of low-level communicative signals generated by his/hers interlocutors. These signals include speech activity, eye-gaze direction, and head orientation, all of which can be estimated in real time during the interaction. By utilizing these signals and novel data representations suitable for the task and context, the developed methods can generate plausible candidate gaze targets in real time. The methods are based on Feedforward Neural Networks and Long Short-Term Memory Networks. The proposed methods are developed using several hours of unrestricted interaction data and their performance is compared with a heuristic baseline method. The study offers an extensive evaluation of the proposed methods that investigates the contribution of different predictors to the accurate generation of candidate gaze targets. The results show that the methods can accurately generate candidate gaze targets when the person being modeled is in a listening state. However, when the person being modeled is in a speaking state, the proposed methods yield significantly lower performance.

Place, publisher, year, edition, pages
ASSOC COMPUTING MACHINERY, 2019
Keywords
Human-human interaction, open-world dialogue, eye-gaze direction, head orientation, multiparty
National Category
Computer Vision and Robotics (Autonomous Systems)
Identifiers
urn:nbn:se:kth:diva-255203 (URN)10.1145/3323231 (DOI)000472066800003 ()
Note

QC 20190904

Available from: 2019-09-04 Created: 2019-09-04 Last updated: 2019-10-15Bibliographically approved
Kucherenko, T., Hasegawa, D., Naoshi, K., Henter, G. E. & Kjellström, H. (2019). On the Importance of Representations for Speech-Driven Gesture Generation: Extended Abstract. In: : . Paper presented at International Conference on Autonomous Agents and Multiagent Systems (AAMAS '19), May 13-17, 2019, Montréal, Canada (pp. 2072-2074). The International Foundation for Autonomous Agents and Multiagent Systems (IFAAMAS)
Open this publication in new window or tab >>On the Importance of Representations for Speech-Driven Gesture Generation: Extended Abstract
Show others...
2019 (English)Conference paper, Published paper (Refereed)
Abstract [en]

This paper presents a novel framework for automatic speech-driven gesture generation applicable to human-agent interaction, including both virtual agents and robots. Specifically, we extend recent deep-learning-based, data-driven methods for speech-driven gesture generation by incorporating representation learning. Our model takes speech features as input and produces gestures in the form of sequences of 3D joint coordinates representing motion as output. The results of objective and subjective evaluations confirm the benefits of the representation learning.

Place, publisher, year, edition, pages
The International Foundation for Autonomous Agents and Multiagent Systems (IFAAMAS), 2019
Keywords
Gesture generation; social robotics; representation learning; neural network; deep learning; virtual agents
National Category
Human Computer Interaction
Research subject
Human-computer Interaction
Identifiers
urn:nbn:se:kth:diva-251648 (URN)
Conference
International Conference on Autonomous Agents and Multiagent Systems (AAMAS '19), May 13-17, 2019, Montréal, Canada
Projects
EACare
Funder
Swedish Foundation for Strategic Research , RIT15-0107
Note

QC 20190515

Available from: 2019-05-16 Created: 2019-05-16 Last updated: 2019-05-22Bibliographically approved
Wolfert, P., Kucherenko, T., Kjellström, H. & Belpaeme, T. (2019). Should Beat Gestures Be Learned Or Designed?: A Benchmarking User Study. In: ICDL-EPIROB 2019: Workshop on Naturalistic Non-Verbal and Affective Human-Robot Interactions. Paper presented at ICDL-EPIROB 2019 Workshop on Naturalistic Non-Verbal and Affective Human-Robot Interactions. IEEE conference proceedings
Open this publication in new window or tab >>Should Beat Gestures Be Learned Or Designed?: A Benchmarking User Study
2019 (English)In: ICDL-EPIROB 2019: Workshop on Naturalistic Non-Verbal and Affective Human-Robot Interactions, IEEE conference proceedings, 2019Conference paper, Published paper (Refereed)
Abstract [en]

In this paper, we present a user study on gener-ated beat gestures for humanoid agents. It has been shownthat Human-Robot Interaction can be improved by includingcommunicative non-verbal behavior, such as arm gestures. Beatgestures are one of the four types of arm gestures, and are knownto be used for emphasizing parts of speech. In our user study,we compare beat gestures learned from training data with hand-crafted beat gestures. The first kind of gestures are generatedby a machine learning model trained on speech audio andhuman upper body poses. We compared this approach with threehand-coded beat gestures methods: designed beat gestures, timedbeat gestures, and noisy gestures. Forty-one subjects participatedin our user study, and a ranking was derived from pairedcomparisons using the Bradley Terry Luce model. We found thatfor beat gestures, the gestures from the machine learning modelare preferred, followed by algorithmically generated gestures.This emphasizes the promise of machine learning for generating communicative actions.

Place, publisher, year, edition, pages
IEEE conference proceedings, 2019
Keywords
gesture generation, machine learning, beat gestures, user study, virtual agents
National Category
Human Computer Interaction
Research subject
Computer Science
Identifiers
urn:nbn:se:kth:diva-255998 (URN)
Conference
ICDL-EPIROB 2019 Workshop on Naturalistic Non-Verbal and Affective Human-Robot Interactions
Note

QC 20190815

Available from: 2019-08-14 Created: 2019-08-14 Last updated: 2019-08-15Bibliographically approved
Hamesse, C., Tu, R., Ackermann, P., Kjellström, H. & Zhang, C. (2019). Simultaneous Measurement Imputation and Outcome Prediction for Achilles Tendon Rupture Rehabilitation. In: Proceedings of Machine Learning Research 106: . Paper presented at Machine Learning for Healthcare 2019, University of Michigan, Ann Arbor, MI August 8-10, 2019.
Open this publication in new window or tab >>Simultaneous Measurement Imputation and Outcome Prediction for Achilles Tendon Rupture Rehabilitation
Show others...
2019 (English)In: Proceedings of Machine Learning Research 106, 2019Conference paper, Published paper (Refereed)
Abstract [en]

Achilles Tendon Rupture (ATR) is one of the typical soft tissue injuries. Rehabilitation after such a musculoskeletal injury remains a prolonged process with a very variable outcome. Accurately predicting rehabilitation outcome is crucial for treatment decision support. However, it is challenging to train an automatic method for predicting the AT Rrehabilitation outcome from treatment data, due to a massive amount of missing entries in the data recorded from ATR patients, as well as complex nonlinear relations between measurements and outcomes. In this work, we design an end-to-end probabilistic framework to impute missing data entries and predict rehabilitation outcomes simultaneously. We evaluate our model on a real-life ATR clinical cohort, comparing with various baselines. The proposed method demonstrates its clear superiority over traditional methods which typically perform imputation and prediction in two separate stages.

National Category
Engineering and Technology Computer Sciences
Identifiers
urn:nbn:se:kth:diva-258070 (URN)
Conference
Machine Learning for Healthcare 2019, University of Michigan, Ann Arbor, MI August 8-10, 2019
Note

QC 20190912

Available from: 2019-09-09 Created: 2019-09-09 Last updated: 2019-09-17Bibliographically approved
Butepage, J., Kjellström, H. & Kragic, D. (2018). Anticipating many futures: Online human motion prediction and generation for human-robot interaction. In: 2018 IEEE INTERNATIONAL CONFERENCE ON ROBOTICS AND AUTOMATION (ICRA): . Paper presented at IEEE International Conference on Robotics and Automation (ICRA), MAY 21-25, 2018, Brisbane, AUSTRALIA (pp. 4563-4570). IEEE COMPUTER SOC
Open this publication in new window or tab >>Anticipating many futures: Online human motion prediction and generation for human-robot interaction
2018 (English)In: 2018 IEEE INTERNATIONAL CONFERENCE ON ROBOTICS AND AUTOMATION (ICRA), IEEE COMPUTER SOC , 2018, p. 4563-4570Conference paper, Published paper (Refereed)
Abstract [en]

Fluent and safe interactions of humans and robots require both partners to anticipate the others' actions. The bottleneck of most methods is the lack of an accurate model of natural human motion. In this work, we present a conditional variational autoencoder that is trained to predict a window of future human motion given a window of past frames. Using skeletal data obtained from RGB depth images, we show how this unsupervised approach can be used for online motion prediction for up to 1660 ms. Additionally, we demonstrate online target prediction within the first 300-500 ms after motion onset without the use of target specific training data. The advantage of our probabilistic approach is the possibility to draw samples of possible future motion patterns. Finally, we investigate how movements and kinematic cues are represented on the learned low dimensional manifold.

Place, publisher, year, edition, pages
IEEE COMPUTER SOC, 2018
Series
IEEE International Conference on Robotics and Automation ICRA, ISSN 1050-4729
National Category
Computer Vision and Robotics (Autonomous Systems)
Identifiers
urn:nbn:se:kth:diva-237164 (URN)000446394503071 ()978-1-5386-3081-5 (ISBN)
Conference
IEEE International Conference on Robotics and Automation (ICRA), MAY 21-25, 2018, Brisbane, AUSTRALIA
Funder
Swedish Foundation for Strategic Research
Note

QC 20181024

Available from: 2018-10-24 Created: 2018-10-24 Last updated: 2019-08-20Bibliographically approved
Mikheeva, O., Ek, C. H. & Kjellström, H. (2018). Perceptual facial expression representation. In: Proceedings - 13th IEEE International Conference on Automatic Face and Gesture Recognition, FG 2018: . Paper presented at 13th IEEE International Conference on Automatic Face and Gesture Recognition, FG 2018, Grand Dynasty Culture HotelXi'an, China, 15 May 2018 through 19 May 2018 (pp. 179-186). Institute of Electrical and Electronics Engineers (IEEE)
Open this publication in new window or tab >>Perceptual facial expression representation
2018 (English)In: Proceedings - 13th IEEE International Conference on Automatic Face and Gesture Recognition, FG 2018, Institute of Electrical and Electronics Engineers (IEEE), 2018, p. 179-186Conference paper, Published paper (Refereed)
Abstract [en]

Dissimilarity measures are often used as a proxy or a handle to reason about data. This can be problematic, as the data representation is often a consequence of the capturing process or how the data is visualized, rather than a reflection of the semantics that we want to extract. Facial expressions are a subtle and essential part of human communication but they are challenging to extract from current representations. In this paper we present a method that is capable of learning semantic representations of faces in a data driven manner. Our approach uses sparse human supervision which our method grounds in the data. We provide experimental justification of our approach showing that our representation improves the performance for emotion classification.

Place, publisher, year, edition, pages
Institute of Electrical and Electronics Engineers (IEEE), 2018
Keywords
Facial expressions, Representation learning, Variational auto encoder
National Category
Language Technology (Computational Linguistics)
Identifiers
urn:nbn:se:kth:diva-238209 (URN)10.1109/FG.2018.00035 (DOI)000454996700025 ()2-s2.0-85049386490 (Scopus ID)9781538623350 (ISBN)
Conference
13th IEEE International Conference on Automatic Face and Gesture Recognition, FG 2018, Grand Dynasty Culture HotelXi'an, China, 15 May 2018 through 19 May 2018
Note

QC 20181122

Available from: 2018-11-22 Created: 2018-11-22 Last updated: 2019-09-18Bibliographically approved
Organisations
Identifiers
ORCID iD: ORCID iD iconorcid.org/0000-0002-5750-9655

Search in DiVA

Show all publications