Ändra sökning
RefereraExporteraLänk till posten
Permanent länk

Direktlänk
Referera
Referensformat
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Annat format
Fler format
Språk
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Annat språk
Fler språk
Utmatningsformat
  • html
  • text
  • asciidoc
  • rtf
CNN features off-the-shelf: An Astounding Baseline for Recognition
KTH, Skolan för datavetenskap och kommunikation (CSC), Datorseende och robotik, CVAP. (Computer Vision)
KTH, Skolan för datavetenskap och kommunikation (CSC), Datorseende och robotik, CVAP. (Computer Vision)ORCID-id: 0000-0001-5211-6388
KTH, Skolan för datavetenskap och kommunikation (CSC), Datorseende och robotik, CVAP. (Computer Vision)
KTH, Skolan för datavetenskap och kommunikation (CSC), Datorseende och robotik, CVAP. (Computer Vision)
2014 (Engelska)Ingår i: Proceedings of CVPR 2014, 2014Konferensbidrag, Publicerat paper (Refereegranskat)
Abstract [en]

Recent results indicate that the generic descriptors extracted from the convolutional neural networks are very powerful. This paper adds to the mounting evidence that this is indeed the case. We report on a series of experiments conducted for different recognition tasks using the publicly available code and model of the OverFeat network which was trained to perform object classification on ILSVRC13. We use features extracted from the OverFeat network as a generic image representation to tackle the diverse range of recognition tasks of object image classification, scene recognition, fine grained recognition, attribute detection and image retrieval applied to a diverse set of datasets. We selected these tasks and datasets as they gradually move further away from the original task and data the OverFeat network was trained to solve. Astonishingly, we report consistent superior results compared to the highly tuned state-of-the-art systems in all the visual classification tasks on various datasets. For instance retrieval it consistently outperforms low memory footprint methods except for sculptures dataset. The results are achieved using a linear SVM classifier (or L2 distance in case of retrieval) applied to a feature representation of size 4096 extracted from a layer in the net. The representations are further modified using simple augmentation techniques e.g. jittering. The results strongly suggest that features obtained from deep learning with convolutional nets should be the primary candidate in most visual recognition tasks.

Ort, förlag, år, upplaga, sidor
2014.
Nationell ämneskategori
Datavetenskap (datalogi)
Identifikatorer
URN: urn:nbn:se:kth:diva-149178DOI: 10.1109/CVPRW.2014.131ISI: 000349552300079Scopus ID: 2-s2.0-84908537903OAI: oai:DiVA.org:kth-149178DiVA, id: diva2:738235
Konferens
Computer Vision and Pattern Recognition (CVPR) 2014, DeepVision workshop,June 28, 2014, Columbus, Ohio
Anmärkning

Best Paper Runner-up Award.

QC 20140825

Tillgänglig från: 2014-08-16 Skapad: 2014-08-16 Senast uppdaterad: 2018-01-11Bibliografiskt granskad
Ingår i avhandling
1. Visual Representations and Models: From Latent SVM to Deep Learning
Öppna denna publikation i ny flik eller fönster >>Visual Representations and Models: From Latent SVM to Deep Learning
2016 (Engelska)Doktorsavhandling, sammanläggning (Övrigt vetenskapligt)
Abstract [en]

Two important components of a visual recognition system are representation and model. Both involves the selection and learning of the features that are indicative for recognition and discarding those features that are uninformative. This thesis, in its general form, proposes different techniques within the frameworks of two learning systems for representation and modeling. Namely, latent support vector machines (latent SVMs) and deep learning.

First, we propose various approaches to group the positive samples into clusters of visually similar instances. Given a fixed representation, the sampled space of the positive distribution is usually structured. The proposed clustering techniques include a novel similarity measure based on exemplar learning, an approach for using additional annotation, and augmenting latent SVM to automatically find clusters whose members can be reliably distinguished from background class. 

In another effort, a strongly supervised DPM is suggested to study how these models can benefit from privileged information. The extra information comes in the form of semantic parts annotation (i.e. their presence and location). And they are used to constrain DPMs latent variables during or prior to the optimization of the latent SVM. Its effectiveness is demonstrated on the task of animal detection.

Finally, we generalize the formulation of discriminative latent variable models, including DPMs, to incorporate new set of latent variables representing the structure or properties of negative samples. Thus, we term them as negative latent variables. We show this generalization affects state-of-the-art techniques and helps the visual recognition by explicitly searching for counter evidences of an object presence.

Following the resurgence of deep networks, in the last works of this thesis we have focused on deep learning in order to produce a generic representation for visual recognition. A Convolutional Network (ConvNet) is trained on a largely annotated image classification dataset called ImageNet with $\sim1.3$ million images. Then, the activations at each layer of the trained ConvNet can be treated as the representation of an input image. We show that such a representation is surprisingly effective for various recognition tasks, making it clearly superior to all the handcrafted features previously used in visual recognition (such as HOG in our first works on DPM). We further investigate the ways that one can improve this representation for a task in mind. We propose various factors involving before or after the training of the representation which can improve the efficacy of the ConvNet representation. These factors are analyzed on 16 datasets from various subfields of visual recognition.

Ort, förlag, år, upplaga, sidor
Stockholm, Sweden: KTH Royal Institute of Technology, 2016. s. 172
Serie
TRITA-CSC-A, ISSN 1653-5723 ; 21
Nyckelord
Computer Vision, Machine Learning, Artificial Intelligence, Deep Learning, Learning Representation, Deformable Part Models, Discriminative Latent Variable Models, Convolutional Networks, Object Recognition, Object Detection
Nationell ämneskategori
Elektroteknik och elektronik Datorsystem
Forskningsämne
Datalogi
Identifikatorer
urn:nbn:se:kth:diva-192289 (URN)978-91-7729-110-7 (ISBN)
Externt samarbete:
Disputation
2016-09-27, Kollegiesalen, Brinellvägen 8, KTH-huset, våningsplan 4, KTH Campus, Stockholm, 15:26 (Engelska)
Opponent
Handledare
Anmärkning

QC 20160908

Tillgänglig från: 2016-09-08 Skapad: 2016-09-08 Senast uppdaterad: 2016-09-09Bibliografiskt granskad
2. Convolutional Network Representation for Visual Recognition
Öppna denna publikation i ny flik eller fönster >>Convolutional Network Representation for Visual Recognition
2017 (Engelska)Doktorsavhandling, sammanläggning (Övrigt vetenskapligt)
Abstract [en]

Image representation is a key component in visual recognition systems. In visual recognition problem, the solution or the model should be able to learn and infer the quality of certain visual semantics in the image. Therefore, it is important for the model to represent the input image in a way that the semantics of interest can be inferred easily and reliably. This thesis is written in the form of a compilation of publications and tries to look into the Convolutional Networks (CovnNets) representation in visual recognition problems from an empirical perspective. Convolutional Network is a special class of Neural Networks with a hierarchical structure where every layer’s output (except for the last layer) will be the input of another one. It was shown that ConvNets are powerful tools to learn a generic representation of an image. In this body of work, we first showed that this is indeed the case and ConvNet representation with a simple classifier can outperform highly-tuned pipelines based on hand-crafted features. To be precise, we first trained a ConvNet on a large dataset, then for every image in another task with a small dataset, we feedforward the image to the ConvNet and take the ConvNets activation on a certain layer as the image representation. Transferring the knowledge from the large dataset (source task) to the small dataset (target task) proved to be effective and outperformed baselines on a variety of tasks in visual recognition. We also evaluated the presence of spatial visual semantics in ConvNet representation and observed that ConvNet retains significant spatial information despite the fact that it has never been explicitly trained to preserve low-level semantics. We then tried to investigate the factors that affect the transferability of these representations. We studied various factors on a diverse set of visual recognition tasks and found a consistent correlation between the effect of those factors and the similarity of the target task to the source task. This intuition alongside the experimental results provides a guideline to improve the performance of visual recognition tasks using ConvNet features. Finally, we addressed the task of visual instance retrieval specifically as an example of how these simple intuitions can increase the performance of the target task massively.

Ort, förlag, år, upplaga, sidor
KTH Royal Institute of Technology, 2017. s. 130
Serie
TRITA-CSC-A, ISSN 1653-5723 ; 2017:01
Nyckelord
Convolutional Network, Visual Recognition, Transfer Learning
Nationell ämneskategori
Robotteknik och automation
Forskningsämne
Datalogi
Identifikatorer
urn:nbn:se:kth:diva-197919 (URN)978-91-7729-213-5 (ISBN)
Disputation
2017-01-13, F3, Lindstedtsvagen 26, Stockholm, 10:00 (Engelska)
Opponent
Handledare
Anmärkning

QC 20161209

Tillgänglig från: 2016-12-09 Skapad: 2016-12-09 Senast uppdaterad: 2016-12-23Bibliografiskt granskad

Open Access i DiVA

fulltext(429 kB)868 nedladdningar
Filinformation
Filnamn FULLTEXT01.pdfFilstorlek 429 kBChecksumma SHA-512
7537632484ba3c829d7ba9937d36eee7c9bb0a7bd75da9493fccaab82862223910348cf09546e7e99c95171a7810e026d2a2446ba850db58bb1fb528b29e9a49
Typ fulltextMimetyp application/pdf

Övriga länkar

Förlagets fulltextScopusConference website

Personposter BETA

Azizpour, Hossein

Sök vidare i DiVA

Av författaren/redaktören
Sharif Razavian, AliAzizpour, HosseinSullivan, JosephineCarlsson, Stefan
Av organisationen
Datorseende och robotik, CVAP
Datavetenskap (datalogi)

Sök vidare utanför DiVA

GoogleGoogle Scholar
Totalt: 868 nedladdningar
Antalet nedladdningar är summan av nedladdningar för alla fulltexter. Det kan inkludera t.ex tidigare versioner som nu inte längre är tillgängliga.

doi
urn-nbn

Altmetricpoäng

doi
urn-nbn
Totalt: 1167 träffar
RefereraExporteraLänk till posten
Permanent länk

Direktlänk
Referera
Referensformat
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Annat format
Fler format
Språk
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Annat språk
Fler språk
Utmatningsformat
  • html
  • text
  • asciidoc
  • rtf