kth.sePublications
Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Leveraging hierarchy in multimodal generative models for effective cross-modality inference
KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Robotics, Perception and Learning, RPL.ORCID iD: 0000-0002-3599-440x
2022 (English)In: Neural Networks, ISSN 0893-6080, E-ISSN 1879-2782, Vol. 146, p. 238-255Article in journal (Refereed) Published
Abstract [en]

This work addresses the problem of cross-modality inference (CMI), i.e., inferring missing data of unavailable perceptual modalities (e.g., sound) using data from available perceptual modalities (e.g., image). We overview single-modality variational autoencoder methods and discuss three problems of computational cross-modality inference, arising from recent developments in multimodal generative models. Inspired by neural mechanisms of human recognition, we contribute the NEXUS model, a novel hierarchical generative model that can learn a multimodal representation of an arbitrary number of modalities in an unsupervised way. By exploiting hierarchical representation levels, NEXUS is able to generate high-quality, coherent data of missing modalities given any subset of available modalities. To evaluate CMI in a natural scenario with a high number of modalities, we contribute the “Multimodal Handwritten Digit” (MHD) dataset, a novel benchmark dataset that combines image, motion, sound and label information from digit handwriting. We access the key role of hierarchy in enabling high-quality samples during cross-modality inference and discuss how a novel training scheme enables NEXUS to learn a multimodal representation robust to missing modalities at test time. Our results show that NEXUS outperforms current state-of-the-art multimodal generative models in regards to their cross-modality inference capabilities. 

Place, publisher, year, edition, pages
Elsevier BV , 2022. Vol. 146, p. 238-255
Keywords [en]
Cross-modality inference, Deep learning, Multimodal representation learning, Auto encoders, Cross modality, Generative model, High quality, Learn+, Missing data, Multi-modal, article, autoencoder, handwriting, human, human experiment, motion, sound
National Category
Computer Systems Computer graphics and computer vision Computer Sciences
Identifiers
URN: urn:nbn:se:kth:diva-313622DOI: 10.1016/j.neunet.2021.11.019ISI: 000799116300007PubMedID: 34906760Scopus ID: 2-s2.0-85120979933OAI: oai:DiVA.org:kth-313622DiVA, id: diva2:1666960
Note

QC 20220609

Available from: 2022-06-09 Created: 2022-06-09 Last updated: 2025-02-01Bibliographically approved

Open Access in DiVA

No full text in DiVA

Other links

Publisher's full textPubMedScopus

Authority records

Yin, Hang

Search in DiVA

By author/editor
Yin, Hang
By organisation
Robotics, Perception and Learning, RPL
In the same journal
Neural Networks
Computer SystemsComputer graphics and computer visionComputer Sciences

Search outside of DiVA

GoogleGoogle Scholar

doi
pubmed
urn-nbn

Altmetric score

doi
pubmed
urn-nbn
Total: 78 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf