kth.sePublications
Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Interpreting Video Features: A Comparison of 3D Convolutional Networks and Convolutional  LSTM Networks
KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Robotics, Perception and Learning, RPL.ORCID iD: 0000-0003-2171-1429
KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Robotics, Perception and Learning, RPL.ORCID iD: 0000-0001-5458-3473
KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Robotics, Perception and Learning, RPL.ORCID iD: 0000-0002-7796-1438
KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Robotics, Perception and Learning, RPL.ORCID iD: 0000-0002-5750-9655
2020 (English)Conference paper, Published paper (Refereed)
Abstract [en]

A number of techniques for interpretability have been presented for deep learning in computer vision, typically with the goal of understanding what the networks have based their classification on. However, interpretability for deep video architectures is still in its infancy and we do not yet have a clear concept of how to decode spatiotemporal features. In this paper, we present a study comparing how 3D convolutional networks and convolutional LSTM networks learn features across temporally dependent frames. This is the first comparison of two video models that both convolve to learn spatial features but have principally different methods of modeling time. Additionally, we extend the concept of meaningful perturbation introduced by Vedaldi et al. to the temporal dimension, to identify the temporal part of a sequence most meaningful to the network for a classification decision. Our findings indicate that the 3D convolutional model concentrates on shorter events in the input sequence, and places its spatial focus on fewer, contiguous areas.

Place, publisher, year, edition, pages
IEEE, 2020.
National Category
Computer graphics and computer vision
Research subject
Computer Science
Identifiers
URN: urn:nbn:se:kth:diva-289568OAI: oai:DiVA.org:kth-289568DiVA, id: diva2:1525242
Conference
Asian Conference on Computer Vision
Note

QC 20210204

Available from: 2021-02-03 Created: 2021-02-03 Last updated: 2025-02-07Bibliographically approved
In thesis
1. Interpretable, Interaction-Aware Vehicle Trajectory Prediction with Uncertainty
Open this publication in new window or tab >>Interpretable, Interaction-Aware Vehicle Trajectory Prediction with Uncertainty
2021 (English)Doctoral thesis, comprehensive summary (Other academic)
Abstract [en]

Autonomous driving technologies have recently made great strides in development, with several companies and research groups getting close to producing a vehicle with full autonomy. Self-driving cars introduce many advantages, including increased traffic safety and added ride-sharing capabilities which reduce environmental effects. To achieve these benefits, many modules must work together on an autonomous platform to solve the multiple tasks required. One of these tasks is the prediction of the future positions and maneuvers of surrounding human drivers. It is necessary for autonomous driving platforms to be able to reason about, and predict, the future trajectories of other agents in traffic scenarios so that they can ensure their planned maneuvers remain safe and feasible throughout their execution. Due to the stochastic nature of many traffic scenarios, these predictions should also take into account the inherent uncertainty involved, caused by both the road structure and driving styles of human drivers. Since many traffic scenarios include vehicles changing their behavior based on the actions of others, for example by yielding or changing lanes, these interactions should be taken into account to produce more robust predictions. Lastly, the prediction methods should also provide a level of transparency and traceability. On an self-driving platform with many safety-critical tasks, it is important to be able to identify where an error occurred in a failure case, and what caused it. This helps prevent the problem from reoccurring, and can also aid in finding new and relevant test cases for simulation.

In this thesis, we present a framework for trajectory prediction of vehicles based on deep learning to fulfill these criteria. We first show that by operating on a generic representation of the traffic scene, our model can implicitly learn interactions between vehicles by capturing the spatio-temporal features in the data using recurrent and convolutional operations, and produce predictions for all vehicles simultaneously. We then explore different methods for incorporating uncertainty regarding the actions of human drivers, and show that Conditional Variational Auto Encoders are highly suited for our prediction method, allowing it to produce multi-modal predictions accounting for different maneuvers as well as variations within them. To address the issue of transparency for deep learning methods, we also develop an interpretability framework for deep learning models operating on sequences of images. This allows us to show, both spatially and temporally, what the models base their output on for all modes of input without requiring a dedicated model architecture, using the proposed Temporal Masks method. Finally, all these extensions are incorporated into one method, and the resulting prediction module is implemented and interfaced with a real-world autonomous driving research platform.

Place, publisher, year, edition, pages
KTH Royal Institute of Technology, 2021. p. 139
Series
TRITA-EECS-AVL ; 2021:9
Keywords
Trajectory Prediction, Computer Vision, Autonomous Driving, Deep Learning, Interpretability
National Category
Computer graphics and computer vision Robotics and automation
Research subject
Computer Science
Identifiers
urn:nbn:se:kth:diva-289569 (URN)978-91-7873-770-3 (ISBN)
Public defence
2021-02-26, F3, Lindstedtsvägen 26, Stockholm, 10:00 (English)
Opponent
Supervisors
Funder
Vinnova, 2016-02547
Note

QC 20210210

Available from: 2021-02-10 Created: 2021-02-03 Last updated: 2025-02-05Bibliographically approved

Open Access in DiVA

No full text in DiVA

Authority records

Mänttäri, JoonatanBroomé, SofiaFolkesson, JohnKjellström, Hedvig

Search in DiVA

By author/editor
Mänttäri, JoonatanBroomé, SofiaFolkesson, JohnKjellström, Hedvig
By organisation
Robotics, Perception and Learning, RPL
Computer graphics and computer vision

Search outside of DiVA

GoogleGoogle Scholar

urn-nbn

Altmetric score

urn-nbn
Total: 41 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf