kth.sePublications
Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Improving Visual Question Answering by Leveraging Depth and Adapting Explainability
KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Robotics, Perception and Learning, RPL.
KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Robotics, Perception and Learning, RPL.ORCID iD: 0000-0002-1733-7019
KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Robotics, Perception and Learning, RPL.ORCID iD: 0000-0002-2212-4325
2022 (English)In: 2022 31St Ieee International Conference On Robot And Human Interactive Communication (Ieee Ro-Man 2022), Institute of Electrical and Electronics Engineers (IEEE) , 2022, p. 252-259Conference paper, Published paper (Refereed)
Abstract [en]

During human-robot conversation, it is critical for robots to be able to answer users' questions accurately and provide a suitable explanation for why they arrive at the answer they provide. Depth is a crucial component in producing more intelligent robots that can respond correctly as some questions might rely on spatial relations within the scene, for which 2D RGB data alone would be insufficient. Due to the lack of existing depth datasets for the task of VQA, we introduce a new dataset, VQA-SUNRGBD. When we compare our proposed model on this RGB-D dataset against the baseline VQN network on RGB data alone, we show that ours outperforms, particularly in questions relating to depth such as asking about the proximity of objects and relative positions of objects to one another. We also provide Grad-CAM activations to gain insight regarding the predictions on depth-related questions and find that our method produces better visual explanations compared to Grad-CAM on RGB data. To our knowledge, this work is the first of its kind to leverage depth and an explainability module to produce an explainable Visual Question Answering (VQA) system.

Place, publisher, year, edition, pages
Institute of Electrical and Electronics Engineers (IEEE) , 2022. p. 252-259
Keywords [en]
Visual Question Answering, Leveraging Depth, Explainability
National Category
Computer Sciences
Identifiers
URN: urn:nbn:se:kth:diva-322304DOI: 10.1109/RO-MAN53752.2022.9900586ISI: 000885903300037Scopus ID: 2-s2.0-85140744461OAI: oai:DiVA.org:kth-322304DiVA, id: diva2:1718210
Conference
31st IEEE International Conference on Robot and Human Interactive Communication (RO-MAN) - Social, Asocial, and Antisocial Robots, AUG 29-SEP 02, 2022, Napoli, ITALY
Note

QC 20221212

Part of proceedings: ISBN 978-1-7281-8859-1

Available from: 2022-12-12 Created: 2022-12-12 Last updated: 2022-12-15Bibliographically approved

Open Access in DiVA

No full text in DiVA

Other links

Publisher's full textScopus

Authority records

Panesar, AmritaDogan, Fethiye IrmakLeite, Iolanda

Search in DiVA

By author/editor
Panesar, AmritaDogan, Fethiye IrmakLeite, Iolanda
By organisation
Robotics, Perception and Learning, RPL
Computer Sciences

Search outside of DiVA

GoogleGoogle Scholar

doi
urn-nbn

Altmetric score

doi
urn-nbn
Total: 40 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf