kth.sePublications KTH
Change search
Link to record
Permanent link

Direct link
Hu, Hao
Publications (2 of 2) Show all publications
Hu, H., Baldassarre, F. & Azizpour, H. (2023). Learnable Masked Tokens for Improved Transferability of Self-supervised Vision Transformers. In: Machine Learning and Knowledge Discovery in Databases: European Conference, ECML PKDD 2022, Grenoble, France, September 19–23, 2022, Proceedings, Part III. Paper presented at 22nd Joint European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases, ECML PKDD 2022, Grenoble 19-23 September 2022 (pp. 409-426). Springer Nature
Open this publication in new window or tab >>Learnable Masked Tokens for Improved Transferability of Self-supervised Vision Transformers
2023 (English)In: Machine Learning and Knowledge Discovery in Databases: European Conference, ECML PKDD 2022, Grenoble, France, September 19–23, 2022, Proceedings, Part III, Springer Nature , 2023, p. 409-426Conference paper, Published paper (Refereed)
Abstract [en]

Vision transformers have recently shown remarkable performance in various visual recognition tasks specifically for self-supervised representation learning. The key advantage of transformers for self supervised learning, compared to their convolutional counterparts, is the reduced inductive biases that makes transformers amenable to learning rich representations from massive amounts of unlabelled data. On the other hand, this flexibility makes self-supervised vision transformers susceptible to overfitting when fine-tuning them on small labeled target datasets. Therefore, in this work, we make a simple yet effective architectural change by introducing new learnable masked tokens to vision transformers whereby we reduce the effect of overfitting in transfer learning while retaining the desirable flexibility of vision transformers. Through several experiments based on two seminal self-supervised vision transformers, SiT and DINO, and several small target visual recognition tasks, we show consistent and significant improvements in the accuracy of the fine-tuned models across all target tasks.

Place, publisher, year, edition, pages
Springer Nature, 2023
Series
Lecture Notes in Computer Science, ISSN 0302-9743, E-ISSN 1611-3349 ; 13715
Keywords
Computer vision, Transfer learning, Vision transformer
National Category
Computer graphics and computer vision
Identifiers
urn:nbn:se:kth:diva-325535 (URN)10.1007/978-3-031-26409-2_25 (DOI)000999043300025 ()2-s2.0-85151048008 (Scopus ID)
Conference
22nd Joint European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases, ECML PKDD 2022, Grenoble 19-23 September 2022
Note

QC 20230620

Available from: 2023-04-27 Created: 2023-04-27 Last updated: 2025-02-07Bibliographically approved
Yao, J., Wang, D., Hu, H., Xing, W. & Wang, L. (2022). ADCNN: Towards learning adaptive dilation for convolutional neural networks. Pattern Recognition, 123, Article ID 108369.
Open this publication in new window or tab >>ADCNN: Towards learning adaptive dilation for convolutional neural networks
Show others...
2022 (English)In: Pattern Recognition, ISSN 0031-3203, E-ISSN 1873-5142, Vol. 123, article id 108369Article in journal (Refereed) Published
Abstract [en]

Dilated convolution kernels are constrained by their shared dilation, keeping them from being aware of diverse spatial contents at different locations. We address such limitations by formulating the dilation as trainable weights with respect to individual positions. We propose Adaptive Dilation Convolutional Neural Networks (ADCNN), a light-weighted extension that allows convolutional kernels to adjust their dilation value based on different contents at the pixel level. Unlike previous content-adaptive models, ADCNN dynamically infers pixel-wise dilation via modeling feed-forward inter-patterns, which provides a new perspective for developing adaptive network structures other than sampling kernel spaces. Our evaluation results indicate ADCNNs can be easily integrated into various backbone networks and consistently outperform their regular counterparts on various visual tasks.

Place, publisher, year, edition, pages
Elsevier BV, 2022
Keywords
Adaptive dilated convolution, Representation learning, Image classification
National Category
Computer Sciences Computer graphics and computer vision Communication Systems
Identifiers
urn:nbn:se:kth:diva-305118 (URN)10.1016/j.patcog.2021.108369 (DOI)000711834400003 ()2-s2.0-85117736740 (Scopus ID)
Note

QC 20211122

Available from: 2021-11-22 Created: 2021-11-22 Last updated: 2025-02-01Bibliographically approved
Organisations

Search in DiVA

Show all publications