kth.sePublications KTH
Change search
Link to record
Permanent link

Direct link
Baldassarre, FedericoORCID iD iconorcid.org/0000-0001-8152-767x
Publications (9 of 9) Show all publications
Hu, H., Baldassarre, F. & Azizpour, H. (2023). Learnable Masked Tokens for Improved Transferability of Self-supervised Vision Transformers. In: Machine Learning and Knowledge Discovery in Databases: European Conference, ECML PKDD 2022, Grenoble, France, September 19–23, 2022, Proceedings, Part III. Paper presented at 22nd Joint European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases, ECML PKDD 2022, Grenoble 19-23 September 2022 (pp. 409-426). Springer Nature
Open this publication in new window or tab >>Learnable Masked Tokens for Improved Transferability of Self-supervised Vision Transformers
2023 (English)In: Machine Learning and Knowledge Discovery in Databases: European Conference, ECML PKDD 2022, Grenoble, France, September 19–23, 2022, Proceedings, Part III, Springer Nature , 2023, p. 409-426Conference paper, Published paper (Refereed)
Abstract [en]

Vision transformers have recently shown remarkable performance in various visual recognition tasks specifically for self-supervised representation learning. The key advantage of transformers for self supervised learning, compared to their convolutional counterparts, is the reduced inductive biases that makes transformers amenable to learning rich representations from massive amounts of unlabelled data. On the other hand, this flexibility makes self-supervised vision transformers susceptible to overfitting when fine-tuning them on small labeled target datasets. Therefore, in this work, we make a simple yet effective architectural change by introducing new learnable masked tokens to vision transformers whereby we reduce the effect of overfitting in transfer learning while retaining the desirable flexibility of vision transformers. Through several experiments based on two seminal self-supervised vision transformers, SiT and DINO, and several small target visual recognition tasks, we show consistent and significant improvements in the accuracy of the fine-tuned models across all target tasks.

Place, publisher, year, edition, pages
Springer Nature, 2023
Series
Lecture Notes in Computer Science, ISSN 0302-9743, E-ISSN 1611-3349 ; 13715
Keywords
Computer vision, Transfer learning, Vision transformer
National Category
Computer graphics and computer vision
Identifiers
urn:nbn:se:kth:diva-325535 (URN)10.1007/978-3-031-26409-2_25 (DOI)000999043300025 ()2-s2.0-85151048008 (Scopus ID)
Conference
22nd Joint European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases, ECML PKDD 2022, Grenoble 19-23 September 2022
Note

QC 20230620

Available from: 2023-04-27 Created: 2023-04-27 Last updated: 2025-02-07Bibliographically approved
Baldassarre, F. (2023). Structured Representations for Explainable Deep Learning. (Doctoral dissertation). Stockholm: KTH Royal Institute of Technology
Open this publication in new window or tab >>Structured Representations for Explainable Deep Learning
2023 (English)Doctoral thesis, comprehensive summary (Other academic)
Abstract [en]

Deep learning has revolutionized scientific research and is being used to take decisions in increasingly complex scenarios. With growing power comes a growing demand for transparency and interpretability. The field of Explainable AI aims to provide explanations for the predictions of AI systems. The state of the art of AI explainability, however, is far from satisfactory. For example, in Computer Vision, the most prominent post-hoc explanation methods produce pixel-wise heatmaps over the input domain, which are meant to visualize the importance of individual pixels of an image or video. We argue that such dense attribution maps are poorly interpretable to non-expert users because of the domain in which explanations are formed - we may recognize shapes in a heatmap but they are just blobs of pixels. In fact, the input domain is closer to the raw data of digital cameras than to the interpretable structures that humans use to communicate, e.g. objects or concepts. In this thesis, we propose to move beyond dense feature attributions by adopting structured internal representations as a more interpretable explanation domain. Conceptually, our approach splits a Deep Learning model in two: the perception step that takes as input dense representations and the reasoning step that learns to perform the task at hand. At the interface between the two are structured representations that correspond to well-defined objects, entities, and concepts. These representations serve as the interpretable domain for explaining the predictions of the model, allowing us to move towards more meaningful and informative explanations. The proposed approach introduces several challenges, such as how to obtain structured representations, how to use them for downstream tasks, and how to evaluate the resulting explanations. The works included in this thesis address these questions, validating the approach and providing concrete contributions to the field. For the perception step, we investigate how to obtain structured representations from dense representations, whether by manually designing them using domain knowledge or by learning them from data without supervision. For the reasoning step, we investigate how to use structured representations for downstream tasks, from Biology to Computer Vision, and how to evaluate the learned representations. For the explanation step, we investigate how to explain the predictions of models that operate in a structured domain, and how to evaluate the resulting explanations. Overall, we hope that this work inspires further research in Explainable AI and helps bridge the gap between high-performing Deep Learning models and the need for transparency and interpretability in real-world applications.

Abstract [sv]

Deep Learning har revolutionerat den vetenskapliga forskningen och används för att fatta beslut i allt mer komplexa scenarier. Med växande makt kommer ett växande krav på transparens och tolkningsbarhet. Området Explainable AI syftar till att ge förklaringar till AI-systems förutsägelser. Prestandan hos existerande lösningar för AI-förklarbarhet är dock långt ifrån tillfredsställande.Till exempel, inom datorseendeområdet, producerar de mest framträdande post-hoc-förklaringsmetoderna pixelvisa värmekartor, som är avsedda att visualisera hur viktiga enskilda pixlar är i en bild eller video. Vi hävdar att sådana metoder är svårtolkade på grund av den domän där förklaringar bildas - vi kanske känner igen former i en värmekarta men de är bara pixlar. Faktum är att indatadomänen ligger närmare digitalkamerors rådata än de strukturer som människor använder för att kommunicera, t.ex. objekt eller koncept.I den här avhandlingen föreslår vi att vi går bortom täta egenskapsattributioner genom att använda strukturerade interna representationer som en mer tolkningsbar förklaringsdomän. Begreppsmässigt delar vårt tillvägagångssätt en Deep Learning-modell i två: perception-steget som tar täta representationer som indata och reasoning-steget som lär sig att utföra uppgiften. I gränssnittet mellan de två finns strukturerade representationer som motsvarar väldefinierade objekt, entiteter och begrepp. Dessa representationer fungerar som den tolkbara domänen för att förklara modellens förutsägelser, vilket gör att vi kan gå mot mer meningsfulla och informativa förklaringar.Det föreslagna tillvägagångssättet introducerar flera utmaningar, såsom hur man skapar strukturerade representationer, hur man använder dem för senare uppgifter och hur man utvärderar de resulterande förklaringarna. Forskningen som ingår i denna avhandling tar upp dessa frågor, validerar tillvägagångssättet och ger konkreta bidrag till området. För steget perception undersöker vi hur man får strukturerade representationer från täta representationer, antingen genom att manuellt designa dem med hjälp av domänkunskap eller genom att lära dem från data utan övervakning. För steget reasoning undersöker vi hur man använder strukturerade representationer för senare uppgifter, från biologi till datorseende, och hur man utvärderar de inlärda representationerna. För steget explanation undersöker vi hur man förklarar förutsägelserna för modeller som fungerar i en strukturerad domän, och hur man utvärderar de resulterande förklaringarna. Sammantaget hoppas vi att detta arbete inspirerar till ytterligare forskning inom Explainable AI och hjälper till att överbrygga klyftan mellan högpresterande Deep Learning-modeller och behovet av transparens och tolkningsbarhet i verkliga applikationer.

Place, publisher, year, edition, pages
Stockholm: KTH Royal Institute of Technology, 2023. p. xi, 103
Series
TRITA-EECS-AVL ; 2023:49
Keywords
Explainable AI, Deep Learning, Self-supervised Learning, Transformers, Graph Networks, Computer Vision, Explainable AI, Deep Learning, Self-supervised Learning, Transformers, Graph Networks, Computer Vision
National Category
Computer graphics and computer vision
Research subject
Computer Science
Identifiers
urn:nbn:se:kth:diva-326958 (URN)978-91-8040-606-2 (ISBN)
Public defence
2023-06-12, F3 https://kth-se.zoom.us/j/66725845533, Lindstedtsvägen 26, Stockholm, 14:00 (English)
Opponent
Supervisors
Funder
Swedish Research Council, 2017-04609
Note

QC 20230516

Available from: 2023-05-16 Created: 2023-05-16 Last updated: 2025-02-07Bibliographically approved
Baldassarre, F., El-Nouby, A. & Jégou, H. (2023). Variable Rate Allocation for Vector-Quantized Autoencoders. In: ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP): . Paper presented at ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Institute of Electrical and Electronics Engineers (IEEE)
Open this publication in new window or tab >>Variable Rate Allocation for Vector-Quantized Autoencoders
2023 (English)In: ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Institute of Electrical and Electronics Engineers (IEEE) , 2023Conference paper, Published paper (Refereed)
Abstract [en]

Vector-quantized autoencoders have recently gained interest in image compression, generation and self-supervised learning. However, as a neural compression method, they lack the possibility to allocate a variable number of bits to each image location, e.g. according to the semantic content or local saliency. In this paper, we address this limitation in a simple yet effective way. We adopt a product quantizer (PQ) that produces a set of discrete codes for each image patch rather than a single index. This PQ-autoencoder is trained end-to-end with a structured dropout that selectively masks a variable number of codes at each location. These mechanisms force the decoder to reconstruct the original image based on partial information and allow us to control the local rate. The resulting model can compress images on a wide range of operating points of the rate-distortion curve and can be paired with any external method for saliency estimation to control the compression rate at a local level. We demonstrate the effectiveness of our approach on the popular Kodak and ImageNet datasets by measuring both distortion and perceptual quality metrics.

Place, publisher, year, edition, pages
Institute of Electrical and Electronics Engineers (IEEE), 2023
National Category
Computer graphics and computer vision
Identifiers
urn:nbn:se:kth:diva-326854 (URN)10.1109/ICASSP49357.2023.10095451 (DOI)2-s2.0-85168851171 (Scopus ID)
Conference
ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
Note

QC 20230516

Available from: 2023-05-12 Created: 2023-05-12 Last updated: 2025-02-07Bibliographically approved
Baldassarre, F., Debard, Q., Pontiveros, G. F. & Wijaya, T. K. (2022). Quantitative Metrics for Evaluating Explanations of Video DeepFake Detectors. In: BMVC 2022 - 33rd British Machine Vision Conference Proceedings: . Paper presented at 33rd British Machine Vision Conference Proceedings, BMVC 2022, London, United Kingdom of Great Britain and Northern Ireland, Nov 21 2022 - Nov 24 2022. British Machine Vision Association, BMVA
Open this publication in new window or tab >>Quantitative Metrics for Evaluating Explanations of Video DeepFake Detectors
2022 (English)In: BMVC 2022 - 33rd British Machine Vision Conference Proceedings, British Machine Vision Association, BMVA , 2022Conference paper, Published paper (Refereed)
Abstract [en]

The proliferation of DeepFake technology is a rising challenge in today's society, owing to more powerful and accessible generation methods. To counter this, the research community has developed detectors of ever-increasing accuracy. However, the ability to explain the decisions of such models to users is lacking behind and is considered an accessory in large-scale benchmarks, despite being a crucial requirement for the correct deployment of automated tools for content moderation. We attribute the issue to the reliance on qualitative comparisons and the lack of established metrics. We describe a simple set of metrics to evaluate the visual quality and informativeness of explanations of video DeepFake classifiers from a human-centric perspective. With these metrics, we compare common approaches to improve explanation quality and discuss their effect on both classification and explanation performance on the recent DFDC and DFD datasets.

Place, publisher, year, edition, pages
British Machine Vision Association, BMVA, 2022
National Category
Computer and Information Sciences
Identifiers
urn:nbn:se:kth:diva-348040 (URN)2-s2.0-85149836351 (Scopus ID)
Conference
33rd British Machine Vision Conference Proceedings, BMVC 2022, London, United Kingdom of Great Britain and Northern Ireland, Nov 21 2022 - Nov 24 2022
Note

QC 20240701

Available from: 2024-07-01 Created: 2024-07-01 Last updated: 2024-07-01Bibliographically approved
Baldassarre, F., Debard, Q., Pontiveros, G. F. & Wijaya, T. K. (2022). Quantitative Metrics for Evaluating Explanations of Video DeepFake Detectors. In: : . Paper presented at 33rd British Machine Vision Conference (BMVC).
Open this publication in new window or tab >>Quantitative Metrics for Evaluating Explanations of Video DeepFake Detectors
2022 (English)Conference paper, Published paper (Refereed)
Abstract [en]

The proliferation of DeepFake technology is a rising challenge in today’s society, owing to more powerful and accessible generation methods. To counter this, the research community has developed detectors of ever-increasing accuracy. However, the ability to explain the decisions of such models to users lags behind performance and is considered an accessory in large-scale benchmarks, despite being a crucial requirement for the correct deployment of automated tools for moderation and censorship. We attribute the issue to the reliance on qualitative comparisons and the lack of established metrics. We describe a simple set of metrics to evaluate the visual quality and informativeness of explanations of video DeepFake classifiers from a human-centric perspective. With these metrics, we compare common approaches to improve explanation quality and discuss their effect on both classification and explanation performance on the recent DFDC and DFD datasets.

National Category
Computer graphics and computer vision
Identifiers
urn:nbn:se:kth:diva-326957 (URN)
Conference
33rd British Machine Vision Conference (BMVC)
Note

QC 20230516

Available from: 2023-05-15 Created: 2023-05-15 Last updated: 2025-02-07Bibliographically approved
Baldassarre, F. & Azizpour, H. (2022). Towards Self-Supervised Learning of Global and Object-Centric Representations. In: : . Paper presented at ICLR Workshop on the Elements of Reasoning, Objects, Structure and Causality.
Open this publication in new window or tab >>Towards Self-Supervised Learning of Global and Object-Centric Representations
2022 (English)Conference paper, Published paper (Refereed)
Abstract [en]

Self-supervision allows learning meaningful representations of natural images, which usually contain one central object. How well does it transfer to multi-entity scenes? We discuss key aspects of learning structured object-centric representations with self-supervision and validate our insights through several experiments on the CLEVR dataset. Regarding the architecture, we confirm the importance of competition for attention-based object discovery, where each image patch is exclusively attended by one object. For training, we show that contrastive losses equipped with matching can be applied directly in a latent space, avoiding pixel-based reconstruction. However, such an optimization objective is sensitive to false negatives (recurring objects) and false positives (matching errors). Careful consideration is thus required around data augmentation and negative sample selection.

National Category
Computer graphics and computer vision
Identifiers
urn:nbn:se:kth:diva-326853 (URN)
Conference
ICLR Workshop on the Elements of Reasoning, Objects, Structure and Causality
Note

QC 20230516

Available from: 2023-05-12 Created: 2023-05-12 Last updated: 2025-02-07Bibliographically approved
Baldassarre, F., Smith, K., Sullivan, J. & Azizpour, H. (2020). Explanation-Based Weakly-Supervised Learning of Visual Relations with Graph Networks. In: Proceedings, Part XXVIII Computer Vision - ECCV 2020 - 16th European Conference, Glasgow, UK, August 23-28, 2020: . Paper presented at Computer Vision - ECCV 2020 - 16th European Conference, Glasgow, UK, August 23-28, 2020 (pp. 612-630). Springer Nature
Open this publication in new window or tab >>Explanation-Based Weakly-Supervised Learning of Visual Relations with Graph Networks
2020 (English)In: Proceedings, Part XXVIII Computer Vision - ECCV 2020 - 16th European Conference, Glasgow, UK, August 23-28, 2020, Springer Nature , 2020, p. 612-630Conference paper, Published paper (Refereed)
Abstract [en]

Visual relationship detection is fundamental for holistic image understanding. However, the localization and classification of (subject, predicate, object) triplets remain challenging tasks, due to the combinatorial explosion of possible relationships, their long-tailed distribution in natural images, and an expensive annotation process. This paper introduces a novel weakly-supervised method for visual relationship detection that relies on minimal image-level predicate labels. A graph neural network is trained to classify predicates in images from a graph representation of detected objects, implicitly encoding an inductive bias for pairwise relations. We then frame relationship detection as the explanation of such a predicate classifier, i.e. we obtain a complete relation by recovering the subject and object of a predicted predicate. We present results comparable to recent fully- and weakly-supervised methods on three diverse and challenging datasets: HICO-DET for human-object interaction, Visual Relationship Detection for generic object-to-object relations, and UnRel for unusual triplets; demonstrating robustness to non-comprehensive annotations and good few-shot generalization.

Place, publisher, year, edition, pages
Springer Nature, 2020
Series
Lecture Notes in Computer Science book series ; 12373
Keywords
Computer vision, Image coding, Supervised learning, Combinatorial explosion, Graph neural networks, Graph representation, Human-object interaction, Long-tailed distributions, Object to objects, Supervised methods, Weakly supervised learning, Object detection
National Category
Computer and Information Sciences
Identifiers
urn:nbn:se:kth:diva-290838 (URN)10.1007/978-3-030-58604-1_37 (DOI)2-s2.0-85097054926 (Scopus ID)
Conference
Computer Vision - ECCV 2020 - 16th European Conference, Glasgow, UK, August 23-28, 2020
Note

Part of ISBN 9783030586034

QC 20210323

Available from: 2021-03-23 Created: 2021-03-23 Last updated: 2023-05-16Bibliographically approved
Baldassarre, F., Menéndez Hurtado, D., Elofsson, A. & Azizpour, H. (2020). GraphQA: Protein Model Quality Assessment using Graph Convolutional Networks. Bioinformatics, 37(3), 360-366
Open this publication in new window or tab >>GraphQA: Protein Model Quality Assessment using Graph Convolutional Networks
2020 (English)In: Bioinformatics, ISSN 1367-4803, E-ISSN 1367-4811, Vol. 37, no 3, p. 360-366Article in journal (Refereed) Published
Abstract [en]

Motivation

Proteins are ubiquitous molecules whose function in biological processes is determined by their 3D structure. Experimental identification of a protein’s structure can be time-consuming, prohibitively expensive, and not always possible. Alternatively, protein folding can be modeled using computational methods, which however are not guaranteed to always produce optimal results.

GraphQA is a graph-based method to estimate the quality of protein models, that possesses favorable properties such as representation learning, explicit modeling of both sequential and 3D structure, geometric invariance, and computational efficiency.

Results

GraphQA performs similarly to state-of-the-art methods despite using a relatively low number of input features. In addition, the graph network structure provides an improvement over the architecture used in ProQ4 operating on the same input features. Finally, the individual contributions of GraphQA components are carefully evaluated.

Availability and implementation

PyTorch implementation, datasets, experiments, and link to an evaluation server are available through this GitHub repository: github.com/baldassarreFe/graphqa

Supplementary information

Supplementary material is available at Bioinformatics online.

Place, publisher, year, edition, pages
Oxford University Press, 2020
Keywords
graph neural networks, protein quality assessment
National Category
Bioinformatics (Computational Biology)
Research subject
Computer Science
Identifiers
urn:nbn:se:kth:diva-284600 (URN)10.1093/bioinformatics/btaa714 (DOI)000667755400010 ()32780838 (PubMedID)2-s2.0-85105697201 (Scopus ID)
Funder
Swedish Research Council, 2017-04609
Note

QC 20201118

Available from: 2020-10-30 Created: 2020-10-30 Last updated: 2023-05-16Bibliographically approved
Baldassarre, F. & Azizpour, H. (2019). Explainability Techniques for Graph Convolutional Networks. In: : . Paper presented at International Conference on Machine Learning (ICML) Workshops, 2019 Workshop on Learning and Reasoning with Graph-Structured Representations.
Open this publication in new window or tab >>Explainability Techniques for Graph Convolutional Networks
2019 (English)Conference paper, Poster (with or without abstract) (Refereed)
Abstract [en]

Graph Networks are used to make decisions in potentially complex scenarios but it is usually not obvious how or why they made them. In this work, we study the explainability of Graph Network decisions using two main classes of techniques, gradient-based and decomposition-based, on a toy dataset and a chemistry task. Our study sets the ground for future development as well as application to real-world problems.

National Category
Computer graphics and computer vision
Identifiers
urn:nbn:se:kth:diva-260507 (URN)
Conference
International Conference on Machine Learning (ICML) Workshops, 2019 Workshop on Learning and Reasoning with Graph-Structured Representations
Note

QC 20191001

Available from: 2019-09-30 Created: 2019-09-30 Last updated: 2025-02-07Bibliographically approved
Organisations
Identifiers
ORCID iD: ORCID iD iconorcid.org/0000-0001-8152-767x

Search in DiVA

Show all publications