Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • harvard1
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Factors of Transferability for a Generic ConvNet Representation
KTH, School of Computer Science and Communication (CSC), Computer Vision and Active Perception, CVAP. (Computer Vision)ORCID iD: 0000-0001-5211-6388
KTH, School of Computer Science and Communication (CSC), Computer Vision and Active Perception, CVAP.
KTH, School of Computer Science and Communication (CSC), Computer Vision and Active Perception, CVAP.
KTH, School of Computer Science and Communication (CSC), Computer Vision and Active Perception, CVAP.ORCID iD: 0000-0002-4266-6746
Show others and affiliations
2016 (English)In: IEEE Transaction on Pattern Analysis and Machine Intelligence, ISSN 0162-8828, E-ISSN 1939-3539, Vol. 38, no 9, 1790-1802 p., 7328311Article in journal (Refereed) Published
Abstract [en]

Evidence is mounting that Convolutional Networks (ConvNets) are the most effective representation learning method for visual recognition tasks. In the common scenario, a ConvNet is trained on a large labeled dataset (source) and the feed-forward units activation of the trained network, at a certain layer of the network, is used as a generic representation of an input image for a task with relatively smaller training set (target). Recent studies have shown this form of representation transfer to be suitable for a wide range of target visual recognition tasks. This paper introduces and investigates several factors affecting the transferability of such representations. It includes parameters for training of the source ConvNet such as its architecture, distribution of the training data, etc. and also the parameters of feature extraction such as layer of the trained ConvNet, dimensionality reduction, etc. Then, by optimizing these factors, we show that significant improvements can be achieved on various (17) visual recognition tasks. We further show that these visual recognition tasks can be categorically ordered based on their similarity to the source task such that a correlation between the performance of tasks and their similarity to the source task w.r.t. the proposed factors is observed.

Place, publisher, year, edition, pages
IEEE Computer Society Digital Library, 2016. Vol. 38, no 9, 1790-1802 p., 7328311
National Category
Computer Vision and Robotics (Autonomous Systems)
Research subject
Computer Science
Identifiers
URN: urn:nbn:se:kth:diva-177033DOI: 10.1109/TPAMI.2015.2500224ISI: 000381432700006Scopus ID: 2-s2.0-84981266620OAI: oai:DiVA.org:kth-177033DiVA: diva2:870273
Note

QC 20161208

Available from: 2015-11-13 Created: 2015-11-13 Last updated: 2017-12-01Bibliographically approved
In thesis
1. Visual Representations and Models: From Latent SVM to Deep Learning
Open this publication in new window or tab >>Visual Representations and Models: From Latent SVM to Deep Learning
2016 (English)Doctoral thesis, comprehensive summary (Other academic)
Abstract [en]

Two important components of a visual recognition system are representation and model. Both involves the selection and learning of the features that are indicative for recognition and discarding those features that are uninformative. This thesis, in its general form, proposes different techniques within the frameworks of two learning systems for representation and modeling. Namely, latent support vector machines (latent SVMs) and deep learning.

First, we propose various approaches to group the positive samples into clusters of visually similar instances. Given a fixed representation, the sampled space of the positive distribution is usually structured. The proposed clustering techniques include a novel similarity measure based on exemplar learning, an approach for using additional annotation, and augmenting latent SVM to automatically find clusters whose members can be reliably distinguished from background class. 

In another effort, a strongly supervised DPM is suggested to study how these models can benefit from privileged information. The extra information comes in the form of semantic parts annotation (i.e. their presence and location). And they are used to constrain DPMs latent variables during or prior to the optimization of the latent SVM. Its effectiveness is demonstrated on the task of animal detection.

Finally, we generalize the formulation of discriminative latent variable models, including DPMs, to incorporate new set of latent variables representing the structure or properties of negative samples. Thus, we term them as negative latent variables. We show this generalization affects state-of-the-art techniques and helps the visual recognition by explicitly searching for counter evidences of an object presence.

Following the resurgence of deep networks, in the last works of this thesis we have focused on deep learning in order to produce a generic representation for visual recognition. A Convolutional Network (ConvNet) is trained on a largely annotated image classification dataset called ImageNet with $\sim1.3$ million images. Then, the activations at each layer of the trained ConvNet can be treated as the representation of an input image. We show that such a representation is surprisingly effective for various recognition tasks, making it clearly superior to all the handcrafted features previously used in visual recognition (such as HOG in our first works on DPM). We further investigate the ways that one can improve this representation for a task in mind. We propose various factors involving before or after the training of the representation which can improve the efficacy of the ConvNet representation. These factors are analyzed on 16 datasets from various subfields of visual recognition.

Place, publisher, year, edition, pages
Stockholm, Sweden: KTH Royal Institute of Technology, 2016. 172 p.
Series
TRITA-CSC-A, ISSN 1653-5723 ; 21
Keyword
Computer Vision, Machine Learning, Artificial Intelligence, Deep Learning, Learning Representation, Deformable Part Models, Discriminative Latent Variable Models, Convolutional Networks, Object Recognition, Object Detection
National Category
Electrical Engineering, Electronic Engineering, Information Engineering Computer Systems
Research subject
Computer Science
Identifiers
urn:nbn:se:kth:diva-192289 (URN)978-91-7729-110-7 (ISBN)
External cooperation:
Public defence
2016-09-27, Kollegiesalen, Brinellvägen 8, KTH-huset, våningsplan 4, KTH Campus, Stockholm, 15:26 (English)
Opponent
Supervisors
Note

QC 20160908

Available from: 2016-09-08 Created: 2016-09-08 Last updated: 2016-09-09Bibliographically approved
2. Convolutional Network Representation for Visual Recognition
Open this publication in new window or tab >>Convolutional Network Representation for Visual Recognition
2017 (English)Doctoral thesis, comprehensive summary (Other academic)
Abstract [en]

Image representation is a key component in visual recognition systems. In visual recognition problem, the solution or the model should be able to learn and infer the quality of certain visual semantics in the image. Therefore, it is important for the model to represent the input image in a way that the semantics of interest can be inferred easily and reliably. This thesis is written in the form of a compilation of publications and tries to look into the Convolutional Networks (CovnNets) representation in visual recognition problems from an empirical perspective. Convolutional Network is a special class of Neural Networks with a hierarchical structure where every layer’s output (except for the last layer) will be the input of another one. It was shown that ConvNets are powerful tools to learn a generic representation of an image. In this body of work, we first showed that this is indeed the case and ConvNet representation with a simple classifier can outperform highly-tuned pipelines based on hand-crafted features. To be precise, we first trained a ConvNet on a large dataset, then for every image in another task with a small dataset, we feedforward the image to the ConvNet and take the ConvNets activation on a certain layer as the image representation. Transferring the knowledge from the large dataset (source task) to the small dataset (target task) proved to be effective and outperformed baselines on a variety of tasks in visual recognition. We also evaluated the presence of spatial visual semantics in ConvNet representation and observed that ConvNet retains significant spatial information despite the fact that it has never been explicitly trained to preserve low-level semantics. We then tried to investigate the factors that affect the transferability of these representations. We studied various factors on a diverse set of visual recognition tasks and found a consistent correlation between the effect of those factors and the similarity of the target task to the source task. This intuition alongside the experimental results provides a guideline to improve the performance of visual recognition tasks using ConvNet features. Finally, we addressed the task of visual instance retrieval specifically as an example of how these simple intuitions can increase the performance of the target task massively.

Place, publisher, year, edition, pages
KTH Royal Institute of Technology, 2017. 130 p.
Series
TRITA-CSC-A, ISSN 1653-5723 ; 2017:01
Keyword
Convolutional Network, Visual Recognition, Transfer Learning
National Category
Robotics
Research subject
Computer Science
Identifiers
urn:nbn:se:kth:diva-197919 (URN)978-91-7729-213-5 (ISBN)
Public defence
2017-01-13, F3, Lindstedtsvagen 26, Stockholm, 10:00 (English)
Opponent
Supervisors
Note

QC 20161209

Available from: 2016-12-09 Created: 2016-12-09 Last updated: 2016-12-23Bibliographically approved

Open Access in DiVA

No full text

Other links

Publisher's full textScopusIEEEXplore

Authority records BETA

Azizpour, HosseinMaki, Atsuto

Search in DiVA

By author/editor
Azizpour, HosseinSharif Razavian, AliSullivan, JosephineMaki, AtsutoCarlssom, Stefan
By organisation
Computer Vision and Active Perception, CVAP
In the same journal
IEEE Transaction on Pattern Analysis and Machine Intelligence
Computer Vision and Robotics (Autonomous Systems)

Search outside of DiVA

GoogleGoogle Scholar

doi
urn-nbn

Altmetric score

doi
urn-nbn
Total: 454 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • harvard1
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf