kth.sePublikationer KTH
Ändra sökning
RefereraExporteraLänk till posten
Permanent länk

Direktlänk
Referera
Referensformat
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Annat format
Fler format
Språk
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Annat språk
Fler språk
Utmatningsformat
  • html
  • text
  • asciidoc
  • rtf
Improving Visual Question Answering by Leveraging Depth and Adapting Explainability
KTH, Skolan för elektroteknik och datavetenskap (EECS), Intelligenta system, Robotik, perception och lärande, RPL.
KTH, Skolan för elektroteknik och datavetenskap (EECS), Intelligenta system, Robotik, perception och lärande, RPL.ORCID-id: 0000-0002-1733-7019
KTH, Skolan för elektroteknik och datavetenskap (EECS), Intelligenta system, Robotik, perception och lärande, RPL.ORCID-id: 0000-0002-2212-4325
2022 (Engelska)Ingår i: 2022 31St Ieee International Conference On Robot And Human Interactive Communication (Ieee Ro-Man 2022), Institute of Electrical and Electronics Engineers (IEEE) , 2022, s. 252-259Konferensbidrag, Publicerat paper (Refereegranskat)
Abstract [en]

During human-robot conversation, it is critical for robots to be able to answer users' questions accurately and provide a suitable explanation for why they arrive at the answer they provide. Depth is a crucial component in producing more intelligent robots that can respond correctly as some questions might rely on spatial relations within the scene, for which 2D RGB data alone would be insufficient. Due to the lack of existing depth datasets for the task of VQA, we introduce a new dataset, VQA-SUNRGBD. When we compare our proposed model on this RGB-D dataset against the baseline VQN network on RGB data alone, we show that ours outperforms, particularly in questions relating to depth such as asking about the proximity of objects and relative positions of objects to one another. We also provide Grad-CAM activations to gain insight regarding the predictions on depth-related questions and find that our method produces better visual explanations compared to Grad-CAM on RGB data. To our knowledge, this work is the first of its kind to leverage depth and an explainability module to produce an explainable Visual Question Answering (VQA) system.

Ort, förlag, år, upplaga, sidor
Institute of Electrical and Electronics Engineers (IEEE) , 2022. s. 252-259
Nyckelord [en]
Visual Question Answering, Leveraging Depth, Explainability
Nationell ämneskategori
Datavetenskap (datalogi)
Identifikatorer
URN: urn:nbn:se:kth:diva-322304DOI: 10.1109/RO-MAN53752.2022.9900586ISI: 000885903300037Scopus ID: 2-s2.0-85140744461OAI: oai:DiVA.org:kth-322304DiVA, id: diva2:1718210
Konferens
31st IEEE International Conference on Robot and Human Interactive Communication (RO-MAN) - Social, Asocial, and Antisocial Robots, AUG 29-SEP 02, 2022, Napoli, ITALY
Anmärkning

QC 20221212

Part of proceedings: ISBN 978-1-7281-8859-1

Tillgänglig från: 2022-12-12 Skapad: 2022-12-12 Senast uppdaterad: 2022-12-15Bibliografiskt granskad

Open Access i DiVA

Fulltext saknas i DiVA

Övriga länkar

Förlagets fulltextScopus

Person

Panesar, AmritaDogan, Fethiye IrmakLeite, Iolanda

Sök vidare i DiVA

Av författaren/redaktören
Panesar, AmritaDogan, Fethiye IrmakLeite, Iolanda
Av organisationen
Robotik, perception och lärande, RPL
Datavetenskap (datalogi)

Sök vidare utanför DiVA

GoogleGoogle Scholar

doi
urn-nbn

Altmetricpoäng

doi
urn-nbn
Totalt: 71 träffar
RefereraExporteraLänk till posten
Permanent länk

Direktlänk
Referera
Referensformat
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Annat format
Fler format
Språk
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Annat språk
Fler språk
Utmatningsformat
  • html
  • text
  • asciidoc
  • rtf