kth.sePublications
Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Exploring the effectiveness of object-centric representations in visual question answering: Comparative insights with foundation models
KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Decision and Control Systems (Automatic Control).ORCID iD: 0000-0002-6820-948X
University of Amsterdam.ORCID iD: 0000-0002-3221-0788
KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Decision and Control Systems (Automatic Control). Digital Futures.ORCID iD: 0000-0001-9940-5929
KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Decision and Control Systems (Automatic Control). Helmholtz AI; TU Munich.ORCID iD: 0000-0003-1712-060X
Show others and affiliations
2024 (English)Manuscript (preprint) (Other academic)
Abstract [en]

Object-centric (OC) representations, which represent the state of a visual scene by modeling it as a composition of objects, have the potential to be used in various downstream tasks to achieve systematic compositional generalization and facilitate reasoning. However, these claims have not been thoroughly analyzed yet.Recently, foundation models have demonstrated unparalleled capabilities across diverse domains from language to computer vision, marking them as a potential cornerstone of future research for a multitude of computational tasks.In this paper, we conduct an extensive empirical study on representation learning for downstream Visual Question Answering (VQA), which requires an accurate compositional understanding of the scene. We thoroughly investigate the benefits and trade-offs of OC models and alternative approaches including large pre-trained foundation models on both synthetic and real-world data, and demonstrate a viable way to achieve the best of both worlds. The extensiveness of our study, encompassing over 600 downstream VQA models and 15 different types of upstream representations, also provides several additional insights that we believe will be of interest to the community at large.

Place, publisher, year, edition, pages
2024.
National Category
Computer Sciences
Identifiers
URN: urn:nbn:se:kth:diva-359638DOI: 10.48550/arXiv.2407.15589OAI: oai:DiVA.org:kth-359638DiVA, id: diva2:1935345
Note

The manuscript is accepted at ICLR 2025 conference

QC 20250211

Available from: 2025-02-06 Created: 2025-02-06 Last updated: 2025-02-12Bibliographically approved
In thesis
1. Bayesian Causal Discovery and Object-Centric Representations: Challenges and Insights in Structured Learning
Open this publication in new window or tab >>Bayesian Causal Discovery and Object-Centric Representations: Challenges and Insights in Structured Learning
2025 (English)Licentiate thesis, comprehensive summary (Other academic)
Abstract [en]

Causality and Representation Learning are foundational to advancing AI systems capable of reasoning, generalizing, and understanding the complex structure of the world. Causality provides tools to uncover the underlying causal structure of a system, understand cause-effect relationships, and reason about interventions. Representation Learning, on the other hand, transforms raw data into structured abstractions essential for modeling the underlying system and decision-making. Causal Representation Learning bridges these paradigms by using representation learning to extract high-level abstractions and entities and integrating causal reasoning principles to uncover cause-effect relationships between these entities. This approach is crucial for real-world systems, where causal relationships are typically defined between high-level entities, such as objects or interactions, rather than low-level sensory inputs like pixels. This thesis explores two key paradigms presented as a collection of two papers: the challenges in the evaluation of Bayesian Causal Discovery, and the effectiveness of structured representations, with a focus on object-centric representations in visual reasoning.

In the first paper, we study the challenges in the evaluation of Bayesian Causal Discovery methods. By analyzing existing metrics on linear additive noise models, we find that current metrics often fail to correlate with the true posterior in high-entropy settings, such as with limited data or non-identifiable causal models. We highlight the importance of considering posterior entropy and recommend evaluating Bayesian Causal Discovery methods on downstream tasks, such as causal effect estimation, for more meaningful evaluation in such scenarios.

In the second paper, we investigate the effectiveness of object-centric representations in visual reasoning tasks, such as Visual Question Answering. We reveal that while large foundation models often match or surpass object-centric models in performance, they require larger downstream models and more compute due to their less explicit representations. In contrast, object-centric models provide more interpretable representations but face challenges on more complex datasets. Combining object-centric representations with foundation models emerges as a promising solution, reducing computational costs while maintaining high performance. Additionally, we provide several additional insights such as segmentation performance versus downstream performance, and the effect of factors such as dataset size and question types, to further improve our understanding of these models.

Abstract [sv]

Kausalitet och representationsinlärning är grundläggande för att utveckla AI-system som kan resonera, generalisera och förstå världens komplexa strukturer. Kausalitet tillhandahåller verktyg för att avslöja den underliggande kausala strukturen i ett system, förstå orsak-verkan-relationer och resonera kring interventioner. Representationsinlärning, å andra sidan, omvandlar rådata till strukturerade abstraktioner som är avgörande för modellering av det underliggande systemet och beslutsfattande. Kausal representationsinlärning sammanför dessa paradigm genom att använda representationsinlärning för att extrahera högre nivåers abstraktioner och enheter samt integrera principer för kausalt resonemang för att avslöja orsak-verkan-relationer mellan dessa entiteter. Detta tillvägagångssätt är avgörande för verkliga system, där kausala relationer vanligtvis definieras mellan högre nivåers entiteter, såsom objekt eller interaktioner, snarare än lågupplösta sensoriska data som pixlar. Denna avhandling undersöker två centrala paradigm presenterade som en samling av två artiklar: utmaningarna i utvärderingen av Bayesiansk kausal upptäckning och effektiviteten av strukturerade representationer, med fokus på objektcentrerade representationer inom visuellt resonemang.

I den första artikeln studerar vi utmaningarna i utvärderingen av metoder för Bayesiansk kausal upptäckning. Genom att analysera befintliga mått på linjära additiva brusmodeller finner vi att nuvarande metoder ofta misslyckas med att korrelera med den sanna posteriorn i högentropiska inställningar, såsom vid begränsad data eller icke-identifierbara kausala modeller. Vi framhäver vikten av att beakta posteriorns entropi och rekommenderar att Bayesiansk kausal upptäckning-metoder utvärderas på nedströmsuppgifter, såsom orsakseffektsberäkning, för att uppnå en mer meningsfull utvärdering i sådana scenarier.

I den andra artikeln undersöker vi effektiviteten av objektcentrerade representationer i visuella resonemangsuppgifter, såsom Visual Question Answering. Vi avslöjar att även om stora grundmodeller ofta kan matcha eller överträffa objektcentrerade-modeller i prestanda, kräver de större nedströmsmodeller och mer beräkningskraft på grund av deras mindre explicita representationer. I kontrast erbjuder objektcentrerade-modeller mer tolkningsbara representationer men möter utmaningar på mer komplexa datamängder. Att kombinera objektcentrerade-representationer med grundmodeller framstår som en lovande lösning, eftersom det minskar beräkningskostnaderna samtidigt som hög prestanda bibehålls. Dessutom presenterar vi flera ytterligare insikter, såsom sambandet mellan segmenteringsprestanda och nedströmsprestanda samt effekten av faktorer som datasetstorlek och frågetyper, för att ytterligare förbättra vår förståelse av dessa modeller.

Place, publisher, year, edition, pages
Stockholm: KTH Royal Institute of Technology, 2025. p. vii, 53
Series
TRITA-EECS-AVL ; 2025:19
Keywords
Causality, Bayesian Causal Discovery, Representation Learning, Object-Centric Learning, Kausalitet, Bayesiansk Kausal Upptäckt, Representationslärande, Objektcentriskt Lärande
National Category
Computer Sciences
Research subject
Computer Science
Identifiers
urn:nbn:se:kth:diva-359733 (URN)978-91-8106-191-8 (ISBN)
Presentation
2025-03-07, https://kth-se.zoom.us/j/68284213723, E3, Rum 1563, Osquars backe 18, KTH Campus, Stockholm, 10:00 (English)
Opponent
Supervisors
Funder
Wallenberg AI, Autonomous Systems and Software Program (WASP), 30007
Note

QC 20250212

Available from: 2025-02-12 Created: 2025-02-10 Last updated: 2025-02-17Bibliographically approved

Open Access in DiVA

No full text in DiVA

Other links

Publisher's full text

Authority records

Mamaghan, Amir Mohammad KarimiJohansson, Karl H.Bauer, StefanDittadi, Andrea

Search in DiVA

By author/editor
Mamaghan, Amir Mohammad KarimiPapa, SamueleJohansson, Karl H.Bauer, StefanDittadi, Andrea
By organisation
Decision and Control Systems (Automatic Control)
Computer Sciences

Search outside of DiVA

GoogleGoogle Scholar

doi
urn-nbn

Altmetric score

doi
urn-nbn
Total: 71 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf