kth.sePublications KTH
Operational message
There are currently operational disruptions. Troubleshooting is in progress.
Change search
Link to record
Permanent link

Direct link
Publications (4 of 4) Show all publications
Fu, J., Zhang, X., Pashami, S., Rahimian, F. & Holst, A. (2025). DiffPAD: Denoising Diffusion-based Adversarial Patch Decontamination. In: 2025 IEEE/CVF Winter Conference On Applications Of Computer Vision (Wacv): . Paper presented at 2025 Winter Conference on Applications of Computer Vision-WACV, FEB 28-MAR 04, 2025, Tucson, AZ (pp. 6602-6611). Institute of Electrical and Electronics Engineers (IEEE)
Open this publication in new window or tab >>DiffPAD: Denoising Diffusion-based Adversarial Patch Decontamination
Show others...
2025 (English)In: 2025 IEEE/CVF Winter Conference On Applications Of Computer Vision (Wacv), Institute of Electrical and Electronics Engineers (IEEE) , 2025, p. 6602-6611Conference paper, Published paper (Refereed)
Abstract [en]

In the ever-evolving adversarial machine learning landscape, developing effective defenses against patch attacks has become a critical challenge, necessitating reliable solutions to safeguard real-world AI systems. Although diffusion models have shown remarkable capacity in image synthesis and have been recently utilized to counter l(p)-norm bounded attacks, their potential in mitigating localized patch attacks remains largely underexplored. In this work, we propose DiffPAD, a novel framework that harnesses the power of diffusion models for adversarial patch decontamination. DiffPAD first performs super-resolution restoration on downsampled input images, then adopts binarization, dynamic thresholding scheme and sliding window for effective localization of adversarial patches. Such a design is inspired by the theoretically derived correlation between patch size and diffusion restoration error that is generalized across diverse patch attack scenarios. Finally, DiffPAD applies inpainting techniques to the original input images with the estimated patch region being masked. By integrating closed-form solutions for super-resolution restoration and image inpainting into the conditional reverse sampling process of a pre-trained diffusion model, DiffPAD obviates the need for text guidance or finetuning. Through comprehensive experiments, we demonstrate that DiffPAD not only achieves state-of-the-art adversarial robustness against patch attacks but also excels in recovering naturalistic images without patch remnants. The source code is available at https://github.com/JasonFu1998/DiffPAD.

Place, publisher, year, edition, pages
Institute of Electrical and Electronics Engineers (IEEE), 2025
Series
IEEE Winter Conference on Applications of Computer Vision, ISSN 2472-6737
National Category
Electrical Engineering, Electronic Engineering, Information Engineering
Identifiers
urn:nbn:se:kth:diva-374026 (URN)10.1109/WACV61041.2025.00643 (DOI)001521272600154 ()2-s2.0-105003628690 (Scopus ID)979-8-3315-1084-8 (ISBN)979-8-3315-1083-1 (ISBN)
Conference
2025 Winter Conference on Applications of Computer Vision-WACV, FEB 28-MAR 04, 2025, Tucson, AZ
Note

QC 20251218

Available from: 2025-12-18 Created: 2025-12-18 Last updated: 2025-12-18Bibliographically approved
Peng, K., Fu, J., Yang, K., Wen, D., Chen, Y., Liu, R., . . . Roitberg, A. (2025). Referring Atomic Video Action Recognition. In: Computer Vision – ECCV 2024 - 18th European Conference, Proceedings: . Paper presented at 18th European Conference on Computer Vision, ECCV 2024, Milan, Italy, Sep 29 2024 - Oct 4 2024 (pp. 166-185). Springer Nature
Open this publication in new window or tab >>Referring Atomic Video Action Recognition
Show others...
2025 (English)In: Computer Vision – ECCV 2024 - 18th European Conference, Proceedings, Springer Nature , 2025, p. 166-185Conference paper, Published paper (Refereed)
Abstract [en]

We introduce a new task called Referring Atomic Video Action Recognition (RAVAR), aimed at identifying atomic actions of a particular person based on a textual description and the video data of this person. This task differs from traditional action recognition and localization, where predictions are delivered for all present individuals. In contrast, we focus on recognizing the correct atomic action of a specific individual, guided by text. To explore this task, we present the RefAVA dataset, containing 36, 630 instances with manually annotated textual descriptions of the individuals. To establish a strong initial benchmark, we implement and validate baselines from various domains, e.g., atomic action localization, video question answering, and text-video retrieval. Since these existing methods underperform on RAVAR, we introduce RefAtomNet – a novel cross-stream attention-driven method specialized for the unique challenges of RAVAR: the need to interpret a textual referring expression for the targeted individual, utilize this reference to guide the spatial localization and harvest the prediction of the atomic actions for the referring person. The key ingredients are: (1) a multi-stream architecture that connects video, text, and a new location-semantic stream, and (2) cross-stream agent attention fusion and agent token fusion which amplify the most relevant information across these streams and consistently surpasses standard attention-based fusion on RAVAR. Extensive experiments demonstrate the effectiveness of RefAtomNet and its building blocks for recognizing the action of the described individual. The dataset and code will be made publicly available at RAVAR.

Place, publisher, year, edition, pages
Springer Nature, 2025
National Category
Computer graphics and computer vision
Identifiers
urn:nbn:se:kth:diva-358213 (URN)10.1007/978-3-031-72655-2_10 (DOI)2-s2.0-85213009172 (Scopus ID)
Conference
18th European Conference on Computer Vision, ECCV 2024, Milan, Italy, Sep 29 2024 - Oct 4 2024
Note

Part of ISBN 9783031726545

QC 20250114

Available from: 2025-01-07 Created: 2025-01-07 Last updated: 2025-02-07Bibliographically approved
Peng, K., Wen, D., Yang, K., Luo, A., Chen, Y., Fu, J., . . . Stiefelhagen, R. (2024). Advancing Open-Set Domain Generalization Using Evidential Bi-Level Hardest Domain Scheduler. In: Advances in Neural Information Processing Systems 37 - 38th Conference on Neural Information Processing Systems, NeurIPS 2024: . Paper presented at 38th Conference on Neural Information Processing Systems, NeurIPS 2024, Vancouver, Canada, December 9-15, 2024. Neural information processing systems foundation
Open this publication in new window or tab >>Advancing Open-Set Domain Generalization Using Evidential Bi-Level Hardest Domain Scheduler
Show others...
2024 (English)In: Advances in Neural Information Processing Systems 37 - 38th Conference on Neural Information Processing Systems, NeurIPS 2024, Neural information processing systems foundation , 2024Conference paper, Published paper (Refereed)
Abstract [en]

In Open-Set Domain Generalization (OSDG), the model is exposed to both new variations of data appearance (domains) and open-set conditions, where both known and novel categories are present at test time. The challenges of this task arise from the dual need to generalize across diverse domains and accurately quantify category novelty, which is critical for applications in dynamic environments. Recently, meta-learning techniques have demonstrated superior results in OSDG, effectively orchestrating the meta-train and -test tasks by employing varied random categories and predefined domain partition strategies. These approaches prioritize a well-designed training schedule over traditional methods that focus primarily on data augmentation and the enhancement of discriminative feature learning. The prevailing meta-learning models in OSDG typically utilize a predefined sequential domain scheduler to structure data partitions. However, a crucial aspect that remains inadequately explored is the influence brought by strategies of domain schedulers during training. In this paper, we observe that an adaptive domain scheduler benefits more in OSDG compared with prefixed sequential and random domain schedulers. We propose the Evidential Bi-Level Hardest Domain Scheduler (EBiL-HaDS) to achieve an adaptive domain scheduler. This method strategically sequences domains by assessing their reliabilities in utilizing a follower network, trained with confidence scores learned in an evidential manner, regularized by max rebiasing discrepancy, and optimized in a bi-level manner. We verify our approach on three OSDG benchmarks, i.e., PACS, DigitsDG, and OfficeHome. The results show that our method substantially improves OSDG performance and achieves more discriminative embeddings for both the seen and unseen categories, underscoring the advantage of a judicious domain scheduler for the generalizability to unseen domains and unseen categories. The source code is publicly available at https://github.com/KPeng9510/EBiL-HaDS.

Place, publisher, year, edition, pages
Neural information processing systems foundation, 2024
National Category
Computer Sciences
Identifiers
urn:nbn:se:kth:diva-361951 (URN)2-s2.0-105000502155 (Scopus ID)
Conference
38th Conference on Neural Information Processing Systems, NeurIPS 2024, Vancouver, Canada, December 9-15, 2024
Note

Part of ISBN 9798331314385

QC 20250408

Available from: 2025-04-03 Created: 2025-04-03 Last updated: 2025-04-08Bibliographically approved
Fu, J., Tan, J., Yin, W., Pashami, S. & Björkman, M. (2023). Component atention network for multimodal dance improvisation recognition. In: PROCEEDINGS OF THE 25TH INTERNATIONAL CONFERENCE ON MULTIMODAL INTERACTION, ICMI 2023: . Paper presented at 25th International Conference on Multimodal Interaction (ICMI), OCT 09-13, 2023, Sorbonne Univ, Paris, FRANCE (pp. 114-118). Association for Computing Machinery (ACM)
Open this publication in new window or tab >>Component atention network for multimodal dance improvisation recognition
Show others...
2023 (English)In: PROCEEDINGS OF THE 25TH INTERNATIONAL CONFERENCE ON MULTIMODAL INTERACTION, ICMI 2023, Association for Computing Machinery (ACM) , 2023, p. 114-118Conference paper, Published paper (Refereed)
Abstract [en]

Dance improvisation is an active research topic in the arts. Motion analysis of improvised dance can be challenging due to its unique dynamics. Data-driven dance motion analysis, including recognition and generation, is often limited to skeletal data. However, data of other modalities, such as audio, can be recorded and benefit downstream tasks. This paper explores the application and performance of multimodal fusion methods for human motion recognition in the context of dance improvisation. We propose an attention-based model, component attention network (CANet), for multimodal fusion on three levels: 1) feature fusion with CANet, 2) model fusion with CANet and graph convolutional network (GCN), and 3) late fusion with a voting strategy. We conduct thorough experiments to analyze the impact of each modality in different fusion methods and distinguish critical temporal or component features. We show that our proposed model outperforms the two baseline methods, demonstrating its potential for analyzing improvisation in dance.

Place, publisher, year, edition, pages
Association for Computing Machinery (ACM), 2023
Keywords
Dance Recognition, Multimodal Fusion, Attention Network
National Category
Other Computer and Information Science
Identifiers
urn:nbn:se:kth:diva-343780 (URN)10.1145/3577190.3614114 (DOI)001147764700016 ()2-s2.0-85175844284 (Scopus ID)
Conference
25th International Conference on Multimodal Interaction (ICMI), OCT 09-13, 2023, Sorbonne Univ, Paris, FRANCE
Note

Part of proceedings ISBN 979-8-4007-0055-2

QC 20240222

Available from: 2024-02-22 Created: 2024-02-22 Last updated: 2024-03-05Bibliographically approved
Organisations
Identifiers
ORCID iD: ORCID iD iconorcid.org/0009-0004-3798-8603

Search in DiVA

Show all publications