Leveraging hierarchy in multimodal generative models for effective cross-modality inference
2022 (English)In: Neural Networks, ISSN 0893-6080, E-ISSN 1879-2782, Vol. 146, p. 238-255Article in journal (Refereed) Published
Abstract [en]
This work addresses the problem of cross-modality inference (CMI), i.e., inferring missing data of unavailable perceptual modalities (e.g., sound) using data from available perceptual modalities (e.g., image). We overview single-modality variational autoencoder methods and discuss three problems of computational cross-modality inference, arising from recent developments in multimodal generative models. Inspired by neural mechanisms of human recognition, we contribute the NEXUS model, a novel hierarchical generative model that can learn a multimodal representation of an arbitrary number of modalities in an unsupervised way. By exploiting hierarchical representation levels, NEXUS is able to generate high-quality, coherent data of missing modalities given any subset of available modalities. To evaluate CMI in a natural scenario with a high number of modalities, we contribute the “Multimodal Handwritten Digit” (MHD) dataset, a novel benchmark dataset that combines image, motion, sound and label information from digit handwriting. We access the key role of hierarchy in enabling high-quality samples during cross-modality inference and discuss how a novel training scheme enables NEXUS to learn a multimodal representation robust to missing modalities at test time. Our results show that NEXUS outperforms current state-of-the-art multimodal generative models in regards to their cross-modality inference capabilities.
Place, publisher, year, edition, pages
Elsevier BV , 2022. Vol. 146, p. 238-255
Keywords [en]
Cross-modality inference, Deep learning, Multimodal representation learning, Auto encoders, Cross modality, Generative model, High quality, Learn+, Missing data, Multi-modal, article, autoencoder, handwriting, human, human experiment, motion, sound
National Category
Computer Systems Computer graphics and computer vision Computer Sciences
Identifiers
URN: urn:nbn:se:kth:diva-313622DOI: 10.1016/j.neunet.2021.11.019ISI: 000799116300007PubMedID: 34906760Scopus ID: 2-s2.0-85120979933OAI: oai:DiVA.org:kth-313622DiVA, id: diva2:1666960
Note
QC 20220609
2022-06-092022-06-092025-02-01Bibliographically approved