kth.sePublications KTH
Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Improving Spatial Understanding Through Learning and Optimization
KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Robotics, Perception and Learning, RPL.ORCID iD: 0000-0001-8747-6359
2025 (English)Doctoral thesis, comprehensive summary (Other academic)
Abstract [en]

Spatial understanding comprises various abilities from pose estimation of objects and cameras within a scene to shape completion given partial observations. These abilities are what enable humans to intuitively navigate and interact with the world. Despite significant progress in large-scale learning, computers still lack the same intuitive spatial understanding of humans. In robotics, this lack of abilities implies limited applicability of classical robotics pipelines in real-world environments and in augmented reality it limits achievable fidelity as well as the interaction of virtual content with real-world objects.

This thesis investigates ways to improve spatial understanding of computers using different learning- and optimization-based techniques. Learning-based methods are employed to learn useful priors about the objects and the 3D world, whereas optimization-based techniques are used to find models of objects and scenes aligning well with a set of observations. Within this framework, we investigate and propose methods for three subproblems of spatial understanding.

First, we propose a modular framework for categorical object pose and shape estimation, which combines a pre-trained generative shape model with a discriminative initialization network which regresses an initial pose and latent shape description from a partial point cloud of an object. By combining the generative shape model with a differentiable renderer we further perform iterative, joint pose and shape optimization from one or multiple views. Our approach outperforms existing methods especially on unconstrained orientations, while achieving competitive results for upright, tabletop objects.

Second, we investigate the use of neural fields for dense, volumetric mapping. Specifically, we propose to represent the scene by a set of spatially constrained, movable neural fields anchored to a pose graph. We formulate the optimization problem of the multi-field scene representation as independent optimization of each field demonstrating that this approach allows real-time loop closure integration, avoids transition artifact at field boundaries, and outperforms current neural-field-based SLAM systems on larger scenes in which significant drift can accumulate.

Third, we investigate large-scale pre-training for visual relocalization using scene coordinate regression. We split up the scene-specific regressor into a scene-agnostic regressor and a scene-specific latent map code, and propose a pre-training scheme for the scene-agnostic coordinate regressor to better generalize from mapping images to query images containing different viewpoints, lighting changes, and objects. We demonstrate that our approach outperforms existing methods under such dynamic mapping-query splits.

Abstract [sv]

Rumslig förståelse omfattar olika förmågor, från pose estimering av objekt och kameror i en scen till formkomplettering av objekt utifrån partiella observationer. Dessa förmågor är vad som gör det möjligt för människor att intuitivt navigera och interagera med världen. Trots betydande framsteg inom storskalig inlärning saknar datorer fortfarande samma intuitiva rumsliga förståelse som människor har. Inom robotik innebär denna brist på förmågor en begränsad tillämpning av klassiska robotpipelines i verkliga miljöer, och inom förstärkt verklighet (augmented reality) begränsar den både den uppnåeliga verklighetsgraden och interaktionen mellan virtuellt innehåll och verkliga objekt.

Denna avhandling undersöker sätt att förbättra datorers rumsliga förståelse med hjälp av olika inlärnings- och optimeringsbaserade tekniker. Inlärningsbaserade metoder används för att lära in användbara förkunskaper om objekt och 3D-världen, medan optimeringsbaserade tekniker används för att hitta modeller av objekt och scener som stämmer väl överens med en uppsättning observationer. Inom detta ramverk undersöker och föreslår vi metoder för tre delproblem inom rumslig förståelse.

För det första föreslår vi ett modulärt ramverk för kategorisk pose- och formbestämning av objekt, vilket kombinerar en förtränad generativ formmodell med ett diskriminativt initialiseringsnätverk som skattar en initial pose och en latent form utifrån ett partiellt punktmoln av ett objekt. Genom att kombinera den generativa formmodellen med en differentierbar renderare utför vi vidare iterativ, gemensam optimering av pose och form från en eller flera vyer. Vår metod överträffar befintliga metoder, särskilt för objekt i fria orienteringar, samtidigt som den uppnår konkurrenskraftiga resultat för upprättstående objekt på en bordsyta.

För det andra undersöker vi användningen av neurala fält (neural fields) för tät, volymetrisk kartläggning. Specifikt föreslår vi att representera scenen med en uppsättning rumsligt begränsade, flyttbara neurala fält förankrade i en posegraf. Vi formulerar optimeringsproblemet för scenrepresentationen med flera fält som oberoende optimering av varje fält och visar att denna metod möjliggör integration av loop-stängning (loop closure) i realtid, undviker övergångsartefakter vid fältgränser och överträffar nuvarande neuralfältbaserade SLAM-system i större scener där betydande drift kan ackumuleras.

För det tredje undersöker vi storskalig förträning för visuell relokalisering med hjälp av regression av scenkoordinater. Vi delar upp den scenspecifika regressorn i en scenagnostisk regressor och en scenspecifik latent kartkod. Vi föreslår ett förträningsschema för den scenagnostiska koordinatregressorn för att bättre generalisera från kartläggningsbilder till sökbilder som innehåller olika synvinklar, ljusförändringar och objektplaceringar. Vi visar att vår metod överträffar befintliga metoder under sådana dynamiska uppdelningar mellan kartläggnings- och sökdata.

Place, publisher, year, edition, pages
Stockholm: KTH Royal Institute of Technology, 2025. , p. viii, 75
Series
TRITA-EECS-AVL ; 2025:97
Keywords [en]
Spatial Understanding, Pose and Shape Estimation, Volumetric Mapping, Visual Relocalization
Keywords [sv]
Rumslig Förståelse, Pose- och Formbestämning, Volymetrisk Kartläggning, Visuell Relokalisering
National Category
Computer Vision and Learning Systems
Identifiers
URN: urn:nbn:se:kth:diva-372393ISBN: 978-91-8106-446-9 (print)OAI: oai:DiVA.org:kth-372393DiVA, id: diva2:2011825
Public defence
2025-12-05, https://kth-se.zoom.us/s/65134312330, F3 (Flodis), Lindstedtsvägen 26 & 28, KTH Campus, Stockholm, 13:00 (English)
Opponent
Supervisors
Note

QC 20251106

Available from: 2025-11-06 Created: 2025-11-05 Last updated: 2025-12-09Bibliographically approved
List of papers
1. SDFEst: Categorical Pose and Shape Estimation of Objects From RGB-D Using Signed Distance Fields
Open this publication in new window or tab >>SDFEst: Categorical Pose and Shape Estimation of Objects From RGB-D Using Signed Distance Fields
2022 (English)In: IEEE Robotics and Automation Letters, E-ISSN 2377-3766, Vol. 7, no 4, p. 9597-9604Article in journal (Refereed) Published
Abstract [en]

Rich geometric understanding of the world is an important component of many robotic applications such as planning and manipulation. In this paper, we present a modular pipeline for pose and shape estimation of objects from RGB-D images given their category. The core of our method is a generative shape model, which we integrate with a novel initialization network and a differentiable renderer to enable 6D pose and shape estimation from a single or multiple views. We investigate the use of discretized signed distance fields as an efficient shape representation for fast analysis-by-synthesis optimization. Our modular framework enables multi-view optimization and extensibility. We demonstrate the benefits of our approach over state-of-the-art methods in several experiments on both synthetic and real data. We open-source our approach at https://github.com/roym899/sdfest.

Place, publisher, year, edition, pages
Institute of Electrical and Electronics Engineers (IEEE), 2022
Keywords
RGB-D perception, deep learning for visual perception
National Category
Robotics and automation
Identifiers
urn:nbn:se:kth:diva-316243 (URN)10.1109/LRA.2022.3189792 (DOI)000831182500039 ()2-s2.0-85134261140 (Scopus ID)
Note

QC 20220817

Available from: 2022-08-17 Created: 2022-08-17 Last updated: 2025-11-05Bibliographically approved
2. RGB-D-based categorical object pose and shape estimation: Methods, datasets, and evaluation
Open this publication in new window or tab >>RGB-D-based categorical object pose and shape estimation: Methods, datasets, and evaluation
2023 (English)In: Robotics and Autonomous Systems, ISSN 0921-8890, E-ISSN 1872-793X, Vol. 168, article id 104507Article in journal (Refereed) Published
Abstract [en]

Recently, various methods for 6D pose and shape estimation of objects at a per-category level have been proposed. This work provides an overview of the field in terms of methods, datasets, and evaluation protocols. First, an overview of existing works and their commonalities and differences is provided. Second, we take a critical look at the predominant evaluation protocol, including metrics and datasets. Based on the findings, we propose a new set of metrics, contribute new annotations for the Redwood dataset, and evaluate state-of-the-art methods in a fair comparison. The results indicate that existing methods do not generalize well to unconstrained orientations and are actually heavily biased towards objects being upright. We provide an easy-to-use evaluation toolbox with well-defined metrics, methods, and dataset interfaces, which allows evaluation and comparison with various state-of-the-art approaches (https://github.com/roym899/pose_and_shape_evaluation).

Place, publisher, year, edition, pages
Elsevier BV, 2023
Keywords
Pose estimation, RGB-D-based perception, Shape estimation, Shape reconstruction
National Category
Computer graphics and computer vision
Identifiers
urn:nbn:se:kth:diva-336565 (URN)10.1016/j.robot.2023.104507 (DOI)001090698300001 ()2-s2.0-85169011550 (Scopus ID)
Note

QC 20230918

Available from: 2023-09-18 Created: 2023-09-18 Last updated: 2025-11-05Bibliographically approved
3. Neural Graph Map: Dense Mapping with Efficient Loop Closure Integration
Open this publication in new window or tab >>Neural Graph Map: Dense Mapping with Efficient Loop Closure Integration
2025 (English)In: 2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Institute of Electrical and Electronics Engineers (IEEE) , 2025, p. 2900-2909Conference paper, Published paper (Refereed)
Abstract [en]

Neural field-based SLAM methods typically employ a single monolithic field as their scene representation. This prevents efficient incorporation of loop closure constraints and limits scalability. To address these shortcomings we propose a novel RGB-D neural mapping framework in which the scene is represented by a collection of lightweight neural fields which are dynamically anchored to the pose graph of a sparse visual SLAM system. Our approach shows the ability to integrate large-scale loop closures while requiring only minimal reintegration. Furthermore we verify the scalability of our approach by demonstrating successful building-scale mapping taking multiple loop closures into account during the optimization and show that our method outperforms existing state-of-the-art approaches on large scenes in terms of quality and runtime. Our code is available open-source at https://github.com/KTH-RPL/neural_graph_mapping. 

Place, publisher, year, edition, pages
Institute of Electrical and Electronics Engineers (IEEE), 2025
National Category
Computer Vision and Learning Systems
Identifiers
urn:nbn:se:kth:diva-372392 (URN)10.1109/WACV61041.2025.00287 (DOI)001481328900277 ()2-s2.0-105003634119 (Scopus ID)
Conference
2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Tucson, AZ, USA, February 26 - March 6, 2025
Note

Part of ISBN 9798331510831, 9798331510848

QC 20251106

Available from: 2025-11-05 Created: 2025-11-05 Last updated: 2025-11-06Bibliographically approved
4. ACE-G: Improving Generalization of Scene Coordinate Regression Through Query Pre-Training
Open this publication in new window or tab >>ACE-G: Improving Generalization of Scene Coordinate Regression Through Query Pre-Training
Show others...
2025 (English)In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2025, p. 26751-26761Conference paper, Published paper (Refereed)
Abstract [en]

Scene coordinate regression (SCR) has established itself as a promising learning-based approach to visual relocalization. After mere minutes of scene-specific training, SCR models estimate camera poses of query images with high accuracy. Still, SCR methods fall short of the generalization capabilities of more classical feature-matching approaches. When imaging conditions of query images, such as lighting or viewpoint, are too different from the training views, SCR models fail. Failing to generalize is an inherent limitation of previous SCR frameworks, since their training objective is to encode the training views in the weights of the coordinate regressor itself. The regressor essentially overfits to the training views, by design. We propose to separate the coordinate regressor and the map representation into a generic transformer and a scene-specific map code. This separation allows us to pre-train the transformer on tens of thousands of scenes. More importantly, it allows us to train the transformer to generalize from mapping images to unseen query images during pre-training. We demonstrate on multiple challenging relocalization datasets that our method, ACE-G, leads to significantly increased robustness while keeping the computational footprint attractive

Keywords
visual relocalization, pose estimation, visual localization
National Category
Computer Vision and Learning Systems
Research subject
Computer Science
Identifiers
urn:nbn:se:kth:diva-372329 (URN)
Conference
IEEE/CVF International Conference on Computer Vision (ICCV), Honolulu, Hawai'i, October 19-23, 2025
Note

QC 20251105

Available from: 2025-11-05 Created: 2025-11-05 Last updated: 2025-11-05Bibliographically approved

Open Access in DiVA

thesis_leonard(125023 kB)94 downloads
File information
File name FULLTEXT01.pdfFile size 125023 kBChecksum SHA-512
6c5c5efeeedba182e06063ff5b4427a61bcf2b221c368014c4db0f89e25262615aa4c077cdefbc5c85cfd39b26dc3929808261a765377e229aed30fa83f540c2
Type fulltextMimetype application/pdf

Authority records

Bruns, Leonard

Search in DiVA

By author/editor
Bruns, Leonard
By organisation
Robotics, Perception and Learning, RPL
Computer Vision and Learning Systems

Search outside of DiVA

GoogleGoogle Scholar
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

isbn
urn-nbn

Altmetric score

isbn
urn-nbn
Total: 1547 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf