kth.sePublications KTH
Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Explanation-Based Weakly-Supervised Learning of Visual Relations with Graph Networks
KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Robotics, Perception and Learning, RPL.ORCID iD: 0000-0001-8152-767X
KTH, School of Electrical Engineering and Computer Science (EECS), Computer Science, Computational Science and Technology (CST).ORCID iD: 0000-0002-6163-191X
KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Robotics, Perception and Learning, RPL.ORCID iD: 0000-0003-2784-7300
KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Robotics, Perception and Learning, RPL.ORCID iD: 0000-0001-5211-6388
2020 (English)In: Proceedings, Part XXVIII Computer Vision - ECCV 2020 - 16th European Conference, Glasgow, UK, August 23-28, 2020, Springer Nature , 2020, p. 612-630Conference paper, Published paper (Refereed)
Abstract [en]

Visual relationship detection is fundamental for holistic image understanding. However, the localization and classification of (subject, predicate, object) triplets remain challenging tasks, due to the combinatorial explosion of possible relationships, their long-tailed distribution in natural images, and an expensive annotation process. This paper introduces a novel weakly-supervised method for visual relationship detection that relies on minimal image-level predicate labels. A graph neural network is trained to classify predicates in images from a graph representation of detected objects, implicitly encoding an inductive bias for pairwise relations. We then frame relationship detection as the explanation of such a predicate classifier, i.e. we obtain a complete relation by recovering the subject and object of a predicted predicate. We present results comparable to recent fully- and weakly-supervised methods on three diverse and challenging datasets: HICO-DET for human-object interaction, Visual Relationship Detection for generic object-to-object relations, and UnRel for unusual triplets; demonstrating robustness to non-comprehensive annotations and good few-shot generalization.

Place, publisher, year, edition, pages
Springer Nature , 2020. p. 612-630
Series
Lecture Notes in Computer Science book series ; 12373
Keywords [en]
Computer vision, Image coding, Supervised learning, Combinatorial explosion, Graph neural networks, Graph representation, Human-object interaction, Long-tailed distributions, Object to objects, Supervised methods, Weakly supervised learning, Object detection
National Category
Computer and Information Sciences
Identifiers
URN: urn:nbn:se:kth:diva-290838DOI: 10.1007/978-3-030-58604-1_37Scopus ID: 2-s2.0-85097054926OAI: oai:DiVA.org:kth-290838DiVA, id: diva2:1539214
Conference
Computer Vision - ECCV 2020 - 16th European Conference, Glasgow, UK, August 23-28, 2020
Note

Part of ISBN 9783030586034

QC 20210323

Available from: 2021-03-23 Created: 2021-03-23 Last updated: 2023-05-16Bibliographically approved
In thesis
1. Structured Representations for Explainable Deep Learning
Open this publication in new window or tab >>Structured Representations for Explainable Deep Learning
2023 (English)Doctoral thesis, comprehensive summary (Other academic)
Abstract [en]

Deep learning has revolutionized scientific research and is being used to take decisions in increasingly complex scenarios. With growing power comes a growing demand for transparency and interpretability. The field of Explainable AI aims to provide explanations for the predictions of AI systems. The state of the art of AI explainability, however, is far from satisfactory. For example, in Computer Vision, the most prominent post-hoc explanation methods produce pixel-wise heatmaps over the input domain, which are meant to visualize the importance of individual pixels of an image or video. We argue that such dense attribution maps are poorly interpretable to non-expert users because of the domain in which explanations are formed - we may recognize shapes in a heatmap but they are just blobs of pixels. In fact, the input domain is closer to the raw data of digital cameras than to the interpretable structures that humans use to communicate, e.g. objects or concepts. In this thesis, we propose to move beyond dense feature attributions by adopting structured internal representations as a more interpretable explanation domain. Conceptually, our approach splits a Deep Learning model in two: the perception step that takes as input dense representations and the reasoning step that learns to perform the task at hand. At the interface between the two are structured representations that correspond to well-defined objects, entities, and concepts. These representations serve as the interpretable domain for explaining the predictions of the model, allowing us to move towards more meaningful and informative explanations. The proposed approach introduces several challenges, such as how to obtain structured representations, how to use them for downstream tasks, and how to evaluate the resulting explanations. The works included in this thesis address these questions, validating the approach and providing concrete contributions to the field. For the perception step, we investigate how to obtain structured representations from dense representations, whether by manually designing them using domain knowledge or by learning them from data without supervision. For the reasoning step, we investigate how to use structured representations for downstream tasks, from Biology to Computer Vision, and how to evaluate the learned representations. For the explanation step, we investigate how to explain the predictions of models that operate in a structured domain, and how to evaluate the resulting explanations. Overall, we hope that this work inspires further research in Explainable AI and helps bridge the gap between high-performing Deep Learning models and the need for transparency and interpretability in real-world applications.

Abstract [sv]

Deep Learning har revolutionerat den vetenskapliga forskningen och används för att fatta beslut i allt mer komplexa scenarier. Med växande makt kommer ett växande krav på transparens och tolkningsbarhet. Området Explainable AI syftar till att ge förklaringar till AI-systems förutsägelser. Prestandan hos existerande lösningar för AI-förklarbarhet är dock långt ifrån tillfredsställande.Till exempel, inom datorseendeområdet, producerar de mest framträdande post-hoc-förklaringsmetoderna pixelvisa värmekartor, som är avsedda att visualisera hur viktiga enskilda pixlar är i en bild eller video. Vi hävdar att sådana metoder är svårtolkade på grund av den domän där förklaringar bildas - vi kanske känner igen former i en värmekarta men de är bara pixlar. Faktum är att indatadomänen ligger närmare digitalkamerors rådata än de strukturer som människor använder för att kommunicera, t.ex. objekt eller koncept.I den här avhandlingen föreslår vi att vi går bortom täta egenskapsattributioner genom att använda strukturerade interna representationer som en mer tolkningsbar förklaringsdomän. Begreppsmässigt delar vårt tillvägagångssätt en Deep Learning-modell i två: perception-steget som tar täta representationer som indata och reasoning-steget som lär sig att utföra uppgiften. I gränssnittet mellan de två finns strukturerade representationer som motsvarar väldefinierade objekt, entiteter och begrepp. Dessa representationer fungerar som den tolkbara domänen för att förklara modellens förutsägelser, vilket gör att vi kan gå mot mer meningsfulla och informativa förklaringar.Det föreslagna tillvägagångssättet introducerar flera utmaningar, såsom hur man skapar strukturerade representationer, hur man använder dem för senare uppgifter och hur man utvärderar de resulterande förklaringarna. Forskningen som ingår i denna avhandling tar upp dessa frågor, validerar tillvägagångssättet och ger konkreta bidrag till området. För steget perception undersöker vi hur man får strukturerade representationer från täta representationer, antingen genom att manuellt designa dem med hjälp av domänkunskap eller genom att lära dem från data utan övervakning. För steget reasoning undersöker vi hur man använder strukturerade representationer för senare uppgifter, från biologi till datorseende, och hur man utvärderar de inlärda representationerna. För steget explanation undersöker vi hur man förklarar förutsägelserna för modeller som fungerar i en strukturerad domän, och hur man utvärderar de resulterande förklaringarna. Sammantaget hoppas vi att detta arbete inspirerar till ytterligare forskning inom Explainable AI och hjälper till att överbrygga klyftan mellan högpresterande Deep Learning-modeller och behovet av transparens och tolkningsbarhet i verkliga applikationer.

Place, publisher, year, edition, pages
Stockholm: KTH Royal Institute of Technology, 2023. p. xi, 103
Series
TRITA-EECS-AVL ; 2023:49
Keywords
Explainable AI, Deep Learning, Self-supervised Learning, Transformers, Graph Networks, Computer Vision, Explainable AI, Deep Learning, Self-supervised Learning, Transformers, Graph Networks, Computer Vision
National Category
Computer graphics and computer vision
Research subject
Computer Science
Identifiers
urn:nbn:se:kth:diva-326958 (URN)978-91-8040-606-2 (ISBN)
Public defence
2023-06-12, F3 https://kth-se.zoom.us/j/66725845533, Lindstedtsvägen 26, Stockholm, 14:00 (English)
Opponent
Supervisors
Funder
Swedish Research Council, 2017-04609
Note

QC 20230516

Available from: 2023-05-16 Created: 2023-05-16 Last updated: 2025-02-07Bibliographically approved

Open Access in DiVA

fulltext(6927 kB)234 downloads
File information
File name FULLTEXT01.pdfFile size 6927 kBChecksum SHA-512
e4c2a6353f668aa467644022bb94b31958e33ebebb38ea5d68f9770641b6c6bfe47351678bb83f883b97421c6af5b73113763ebb8ed42716ad644d0070d56743
Type fulltextMimetype application/pdf

Other links

Publisher's full textScopusPublication on ECCV websiteGitHub repository

Authority records

Baldassarre, FedericoSmith, KevinSullivan, JosephineAzizpour, Hossein

Search in DiVA

By author/editor
Baldassarre, FedericoSmith, KevinSullivan, JosephineAzizpour, Hossein
By organisation
Robotics, Perception and Learning, RPLComputational Science and Technology (CST)
Computer and Information Sciences

Search outside of DiVA

GoogleGoogle Scholar
Total: 234 downloads
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

doi
urn-nbn

Altmetric score

doi
urn-nbn
Total: 379 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf