kth.sePublications KTH
12345672 of 17
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Beyond Standard Assumptions in Autonomous Driving Perception
KTH, School of Electrical Engineering and Computer Science (EECS), Robotics, Perception and Learning. Traton AB.ORCID iD: 0009-0009-6935-6797
2026 (English)Doctoral thesis, comprehensive summary (Other academic)
Abstract [en]

Autonomous driving perception is commonly developed and evaluated under a set of enabling assumptions: that multi-sensor evidence is physically consistent at the frame level, that geometry is sufficiently dense to support reliable inference about other traffic participants and the surrounding environment, and that learning can rely on either abundant human labels or self-supervised objectives derived from the sensor stream. This thesis examines what remains feasible when these assumptions no longer hold, and develops methods and design principles for perception under asynchronous sensing, long-range sparsity, and weak or unreliable supervision.

We first study physical inconsistency in multi-sensor data. We show that rolling and asynchronous acquisition, motion during aggregation, and annotation practices that implicitly assume temporal coherence can render the perception problem ill-posed before any representation choice is made. We therefore treat data preparation, motion compensation, and annotation consistency as integral parts of the perception pipeline, since errors at this stage can propagate directly into annotation, training, and evaluation.

We then examine representation under long-range sparsity. We show that long-range performance is limited not only by model capacity, but by the representations used to encode and expose ambiguous evidence. In particular, object-centric outputs and dense internal representations can force premature commitment when available evidence collapses at distance. To study this, we present results on long-range 3D object detection and sparse long-range scene flow, showing both the limits of object-centric perception under weak observability and the value of motion-centric estimation as range increases.

Finally, we study learning signals when labels and geometry-derived self-supervision become unreliable. We show that motion supervision can be recovered by importing physically grounded constraints from complementary modalities, using radar Doppler to guide LiDAR scene flow learning. We further show that scalable semantic supervision can be obtained from foundation-model priors through curriculum-based synthetic-to-real adaptation, which anchors language-aligned representations to real LiDAR characteristics.

Abstract [sv]

Uppfattning om autonom körning utvecklas och utvärderas vanligtvis under en uppsättning möjliggörande antaganden: att multisensorbevis är fysiskt konsistenta på bildnivå, att geometrin är tillräckligt tät för att stödja tillförlitlig slutsats om andra trafikdeltagare och den omgivande miljön, och att inlärning kan förlita sig på antingen rikliga mänskliga etiketter eller självövervakade mål som härrör från sensorströmmen. Denna avhandling undersöker vad som förblir genomförbart när dessa antaganden inte längre gäller, och utvecklar metoder och designprinciper för uppfattning under asynkron avkänning, långdistansgleshet och svag eller opålitlig övervakning.

Vi studerar först fysisk inkonsekvens i multisensordata. Vi visar att rullande och asynkron förvärv, rörelse under aggregering och annoteringsmetoder som implicit antar temporal koherens kan göra uppfattningsproblemet felaktigt ställt innan något representationsval görs. Vi behandlar därför dataförberedelse, rörelsekompensation och annoteringskonsistens som integrerade delar av uppfattningsprocessen, eftersom fel i detta skede kan fortplanta sig direkt till annotering, träning och utvärdering.

Vi undersöker sedan representation under långdistansgleshet. Vi visar att prestanda på lång räckvidd begränsas inte bara av modellens kapacitet, utan också av de representationer som används för att koda och exponera tvetydiga bevis. I synnerhet kan objektcentrerade utdata och täta interna representationer tvinga fram för tidigt engagemang när tillgängliga bevis kollapsar på avstånd. För att studera detta presenterar vi resultat om 3D-objektdetektering på lång räckvidd och gles scenflöde på lång räckvidd, vilket visar både gränserna för objektcentrerad perception under svag observerbarhet och värdet av rörelsecentrerad uppskattning när avståndet ökar.

Slutligen studerar vi inlärningssignaler när etiketter och geometri-härledd självövervakning blir opålitliga. Vi visar att rörelseövervakning kan återställas genom att importera fysiskt grundade begränsningar från komplementära modaliteter, med hjälp av radar-Doppler för att vägleda LiDAR-scenflödesinlärning. Vi visar vidare att skalbar semantisk övervakning kan erhållas från grundläggande modellprior genom läroplanbaserad syntetisk-till-real-anpassning, som förankrar språkanpassade representationer till verkliga LiDAR-egenskaper.

Place, publisher, year, edition, pages
KTH Royal Institute of Technology, 2026. , p. 103
Series
TRITA-EECS-AVL ; 2026:22
Keywords [en]
Autonomous Driving, Computer Vision, Robotics
National Category
Computer graphics and computer vision Robotics and automation
Research subject
Computer Science
Identifiers
URN: urn:nbn:se:kth:diva-378742ISBN: 978-91-8106-558-9 (print)OAI: oai:DiVA.org:kth-378742DiVA, id: diva2:2049001
Public defence
2026-04-17, Kollegiesalen, Brinellvägen 8, Stockholm, 09:00 (English)
Opponent
Supervisors
Note

Zoom link: https://kth-se.zoom.us/s/68091974260

Available from: 2026-03-27 Created: 2026-03-26 Last updated: 2026-04-08Bibliographically approved
List of papers
1. Addressing Data Annotation Challenges in Multiple Sensors: A Solution for Scania Collected Datasets
Open this publication in new window or tab >>Addressing Data Annotation Challenges in Multiple Sensors: A Solution for Scania Collected Datasets
Show others...
2024 (English)In: 2024 European Control Conference, ECC 2024, Institute of Electrical and Electronics Engineers (IEEE) , 2024, p. 1032-1038Conference paper, Published paper (Refereed)
Abstract [en]

Data annotation in autonomous vehicles is a critical step in the development of Deep Neural Network (DNN) based models or the performance evaluation of the perception system. This often takes the form of adding 3D bounding boxes on time-sequential and registered series of point-sets captured from active sensors like Light Detection and Ranging (LiDAR) and Radio Detection and Ranging (RADAR). When annotating multiple active sensors, there is a need to motion compensate and translate the points to a consistent coordinate frame and timestamp respectively. However, highly dynamic objects pose a unique challenge, as they can appear at different timestamps in each sensor's data. Without knowing the speed of the objects, their position appears to be different in different sensor outputs. Thus, even after motion compensation, highly dynamic objects are not matched from multiple sensors in the same frame, and human annotators struggle to add unique bounding boxes that capture all objects. This article focuses on addressing this challenge, primarily within the context of Scania-collected datasets. The proposed solution takes a track of an annotated object as input and uses the Moving Horizon Estimation (MHE) to robustly estimate its speed. The estimated speed profile is utilized to correct the position of the annotated box and add boxes to object clusters missed by the original annotation.

Place, publisher, year, edition, pages
Institute of Electrical and Electronics Engineers (IEEE), 2024
National Category
Electrical Engineering, Electronic Engineering, Information Engineering
Identifiers
urn:nbn:se:kth:diva-351940 (URN)10.23919/ECC64448.2024.10590958 (DOI)001290216500156 ()2-s2.0-85200563187 (Scopus ID)
Conference
2024 European Control Conference, ECC 2024, Stockholm, Sweden, Jun 25 2024 - Jun 28 2024
Note

Part of ISBN [9783907144107]

QC 20240830

Available from: 2024-08-19 Created: 2024-08-19 Last updated: 2026-03-26Bibliographically approved
2. HiMo: High-Speed Objects Motion Compensation in Point Clouds
Open this publication in new window or tab >>HiMo: High-Speed Objects Motion Compensation in Point Clouds
Show others...
2025 (English)In: IEEE Transactions on robotics, ISSN 1552-3098, E-ISSN 1941-0468, Vol. 41, p. 5896-5911Article in journal (Refereed) Published
Abstract [en]

LiDAR point cloud is essential for autonomous vehicles, but motion distortions from dynamic objects degrade the data quality. While previous work has considered distortions caused by ego motion, distortions caused by other moving objects remain largely overlooked, leading to errors in object shape and position. This distortion is particularly pronounced in high-speed environments such as highways and in multi-LiDAR configurations, a common setup for heavy vehicles. To address this challenge, we introduce HiMo, a pipeline that repurposes scene flow estimation for non-ego motion compensation, correcting the representation of dynamic objects in point clouds. During the development of HiMo, we observed that existing self-supervised scene flow estimators often produce degenerate or inconsistent estimates under high-speed distortion. We further propose SeFlow++, a real-time scene flow estimator that achieves state-of-the-art performance on both scene flow and motion compensation. Since well-established motion distortion metrics are absent in the literature, we introduce two evaluation metrics: compensation accuracy at a point level and shape similarity of objects. We validate HiMo through extensive experiments on Argoverse 2, ZOD and a newly collected real-world dataset featuring highway driving and multi-LiDAR-equipped heavy vehicles. Our findings show that HiMo improves the geometric consistency and visual fidelity of dynamic objects in LiDAR point clouds, benefiting downstream tasks such as semantic segmentation and 3D detection. See https://kin-zhang.github.io/HiMo for more details.

Place, publisher, year, edition, pages
Institute of Electrical and Electronics Engineers (IEEE), 2025
Keywords
Autonomous Driving Navigation, Computer Vision for Transportation, Motion Compensation, Range Sensing
National Category
Computer graphics and computer vision Robotics and automation Vehicle and Aerospace Engineering Signal Processing
Identifiers
urn:nbn:se:kth:diva-372474 (URN)10.1109/TRO.2025.3619042 (DOI)2-s2.0-105019222489 (Scopus ID)
Note

QC 20251107

Available from: 2025-11-07 Created: 2025-11-07 Last updated: 2026-03-26Bibliographically approved
3. Towards Long-Range 3D Object Detection for Autonomous Vehicles
Open this publication in new window or tab >>Towards Long-Range 3D Object Detection for Autonomous Vehicles
Show others...
2024 (English)In: 35th IEEE Intelligent Vehicles Symposium, IV 2024, Institute of Electrical and Electronics Engineers (IEEE) , 2024, p. 2206-2212Conference paper, Published paper (Refereed)
Abstract [en]

3D object detection at long-range is crucial for ensuring the safety and efficiency of self-driving vehicles, allowing them to accurately perceive and react to objects, obstacles, and potential hazards from a distance. But most current state-of-the-art LiDAR based methods are range limited due to sparsity at long-range, which generates a form of domain gap between points closer to and farther away from the ego vehicle. Another related problem is the label imbalance for faraway objects, which inhibits the performance of Deep Neural Networks at long-range. To address the above limitations, we investigate two ways to improve long-range performance of current LiDAR-based 3D detectors. First, we combine two 3D detection networks, referred to as range experts, one specializing at near to mid-range objects, and one at long-range 3D detection. To train a detector at long-range under a scarce label regime, we further weigh the loss according to the labelled point's distance from ego vehicle. Second, we augment LiDAR scans with virtual points generated using Multimodal Virtual Points (MVP), a readily available image-based depth completion algorithm. Our experiments on the long-range Argoverse2 (AV2) dataset indicate that MVP is more effective in improving long range performance, while maintaining a straightforward implementation. On the other hand, the range experts offer a computationally efficient and simpler alternative, avoiding dependency on image-based segmentation networks and perfect camera-LiDAR calibration.

Place, publisher, year, edition, pages
Institute of Electrical and Electronics Engineers (IEEE), 2024
National Category
Computer graphics and computer vision Robotics and automation
Identifiers
urn:nbn:se:kth:diva-351752 (URN)10.1109/IV55156.2024.10588513 (DOI)001275100902040 ()2-s2.0-85199779839 (Scopus ID)
Conference
35th IEEE Intelligent Vehicles Symposium, IV 2024, Jeju Island, Korea, Jun 2 2024 - Jun 5 2024
Note

Part of ISBN [9798350348811]

QC 20240823

Available from: 2024-08-13 Created: 2024-08-13 Last updated: 2026-03-26Bibliographically approved
4. SSF: Sparse Long-Range Scene Flow for Autonomous Driving
Open this publication in new window or tab >>SSF: Sparse Long-Range Scene Flow for Autonomous Driving
Show others...
2025 (English)In: 2025 IEEE International Conference on Robotics and Automation, ICRA 2025, Institute of Electrical and Electronics Engineers (IEEE) , 2025, p. 6394-6400Conference paper, Published paper (Refereed)
Abstract [en]

Scene flow enables an understanding of the motion characteristics of the environment in the 3D world. It gains particular significance in the long-range, where object-based perception methods might fail due to sparse observations far away. Although significant advancements have been made in scene flow pipelines to handle large-scale point clouds, a gap remains in scalability with respect to long-range. We attribute this limitation to the common design choice of using dense feature grids, which scale quadratically with range. In this paper, we propose Sparse Scene Flow (SSF), a general pipeline for long-range scene flow, adopting a sparse convolution based backbone for feature extraction. This approach introduces a new challenge: a mismatch in size and ordering of sparse feature maps between time-sequential point scans. To address this, we propose a sparse feature fusion scheme, that augments the feature maps with virtual voxels at missing locations. Additionally, we propose a range-wise metric that implicitly gives greater importance to faraway points. Our method, SSF, achieves state-of-the-art results on the Argoverse2 dataset, demonstrating strong performance in long-range scene flow estimation. Our code is open-sourced at https://github.com/KTH-RPL/SSF.git.

Place, publisher, year, edition, pages
Institute of Electrical and Electronics Engineers (IEEE), 2025
National Category
Computer graphics and computer vision Computer Sciences Condensed Matter Physics Signal Processing
Identifiers
urn:nbn:se:kth:diva-371385 (URN)10.1109/ICRA55743.2025.11128770 (DOI)2-s2.0-105016555490 (Scopus ID)979-8-3315-4139-2 (ISBN)
Conference
2025 IEEE International Conference on Robotics and Automation, ICRA 2025, Atlanta, United States of America, May 19 2025 - May 23 2025
Note

Part of ISBN 979-8-3315-4139-2

QC 20251009

Available from: 2025-10-09 Created: 2025-10-09 Last updated: 2026-03-26Bibliographically approved
5. DoGFlow: Self-Supervised LiDAR Scene Flow via Cross-Modal Doppler Guidance
Open this publication in new window or tab >>DoGFlow: Self-Supervised LiDAR Scene Flow via Cross-Modal Doppler Guidance
Show others...
2026 (English)In: IEEE Robotics and Automation Letters, E-ISSN 2377-3766, Vol. 11, no 3, p. 3836-3843Article in journal (Refereed) Published
Abstract [en]

Accurate 3D scene flow estimation is critical for autonomous systems to navigate dynamic environments safely, but creating the necessary large-scale, manually annotated datasets remains a significant bottleneck for developing robust perception models. Current self-supervised methods struggle to match the performance of fully supervised approaches, especially in challenging long-range and adverse weather scenarios, while supervised methods are not scalable due to their reliance on expensive human labeling. We introduce DoGFlow, a novel selfsupervised framework that recovers full 3D object motions for LiDAR scene flow estimation without requiring any manual ground truth annotations. This paper presents our cross-modal label transfer approach, where DoGFlow computes motion labels directly from 4D radar Doppler measurements and transfers them to the LiDAR domain using dynamic-aware association and ambiguity-resolved propagation. On the challenging MAN TruckScenes dataset, DoGFlow substantially outperforms existing self-supervised methods and improves label efficiency by enabling LiDAR backbones to achieve over 90% of fully supervised performance with only 10% of the ground truth data. For more details including supplementary material, please visit https://ajinkyakhoche.github.io/DoGFlow/.

Place, publisher, year, edition, pages
Institute of Electrical and Electronics Engineers (IEEE), 2026
Keywords
Computer Vision for Transportation, Deep Learning for Visual Perception, Sensor Fusion
National Category
Computer graphics and computer vision
Identifiers
urn:nbn:se:kth:diva-377626 (URN)10.1109/LRA.2026.3662592 (DOI)001694547300026 ()2-s2.0-105030121636 (Scopus ID)
Note

QC 20260305

Available from: 2026-03-05 Created: 2026-03-05 Last updated: 2026-03-26Bibliographically approved
6. BlendCLIP: Bridging Synthetic and Real Domains for Zero-Shot 3D Object Classification with Multimodal Pretraining
Open this publication in new window or tab >>BlendCLIP: Bridging Synthetic and Real Domains for Zero-Shot 3D Object Classification with Multimodal Pretraining
Show others...
2026 (English)In: Winter Conference on Applications of Computer Vision, 2026, p. 5766-5775Conference paper, Published paper (Refereed)
Abstract [en]

Zero-shot 3D object classification is crucial for real-world applications like autonomous driving, however itis often hindered by a significant domain gap betweenthe synthetic data used for training and the sparse, noisyLiDAR scans encountered in the real-world. Currentmethods trained solely on synthetic data fail to general-ize to outdoor scenes, while those trained only on realdata lack the semantic diversity to recognize rare or un-seen objects. We introduce BlendCLIP, a multimodalpretraining framework that bridges this synthetic-to-realgap by strategically combining the strengths of both do-mains. We first propose a pipeline to generate a large-scale dataset of object-level triplets—consisting of a pointcloud, image, and text description—mined directly fromreal-world driving data and human annotated 3D boxes.Our core contribution is a curriculum-based data mix-ing strategy that first grounds the model in the semanti-cally rich synthetic CAD data before progressively adapt-ing it to the specific characteristics of real-world scans.Our experiments show that our approach is highly label-efficient: introducing as few as 1.5% real-world samplesper batch into training boosts zero-shot accuracy on thenuScenes benchmark by 27%. Consequently, our finalmodel achieves state-of-the-art performance on challeng-ing outdoor datasets like nuScenes and TruckScenes, im-proving over the best prior method by 19.3% on nuScenes,while maintaining strong generalization on diverse syn-thetic benchmarks. Our findings demonstrate that effectivedomain adaptation, not full-scale real-world annotation, isthe key to unlocking robust open-vocabulary 3D perception.Our code and dataset will be released upon acceptance onhttps://github.com/kesu1/BlendCLIP.

National Category
Computer graphics and computer vision
Identifiers
urn:nbn:se:kth:diva-378741 (URN)
Conference
The IEEE/CVF Winter Conference on Applications of Computer Vision 2026, Tucson, Arizona, March 6-10, 2026
Note

QC 20260330

Available from: 2026-03-26 Created: 2026-03-26 Last updated: 2026-03-30

Open Access in DiVA

fulltext(22234 kB)240 downloads
File information
File name FULLTEXT01.pdfFile size 22234 kBChecksum SHA-512
295dc62d67f870328adb96cc8d8b767f88342cb95b201aebf5849441e5b7216fae4bb7d670ac685fb50f0acfcf5d81eb17f979fd0f01bbb091fbf0c530449c6f
Type fulltextMimetype application/pdf
Mediaagreement(189 kB)15 downloads
File information
File name FULLTEXT02.pdfFile size 189 kBChecksum SHA-512
e4bd9dea3e27da5d03dc5f36e68ceef876f6aaa18d8f4a83b5c379f82a992a9ef276536020f6275656ad91b081ccb328a29b88ebf5cf1587913ce312ab15d443
Type fulltextMimetype application/pdf

Authority records

Khoche, Ajinkya

Search in DiVA

By author/editor
Khoche, Ajinkya
By organisation
Robotics, Perception and Learning
Computer graphics and computer visionRobotics and automation

Search outside of DiVA

GoogleGoogle Scholar
Total: 256 downloads
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

isbn
urn-nbn

Altmetric score

isbn
urn-nbn
Total: 1208 hits
12345672 of 17
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf