1
|
Feng C, Zhang C, Chen Z, Hu W, Lu K, Ge L. Self-Supervised Monocular Depth Estimation with Dual-Path Encoders and Offset Field Interpolation. IEEE TRANSACTIONS ON IMAGE PROCESSING : A PUBLICATION OF THE IEEE SIGNAL PROCESSING SOCIETY 2025; PP:939-954. [PMID: 40031307 DOI: 10.1109/tip.2025.3533207] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 03/05/2025]
Abstract
Although self-supervised learning approaches have demonstrated tremendous potential in multi-frame depth estimation scenarios, existing methods struggle to perform well in cases involving dynamic targets and static ego-camera conditions. To address this issue, we propose a self-supervised monocular depth estimation method featuring dual-path encoders and learnable offset interpolation (LOI). First, we construct a dual-path encoding scheme that utilizes residual and transformer blocks to extract both single- and multi-frame features from the input frames. We design a contrastive learning strategy to effectively decouple single- and multi-frame features, enabling weighted fusion guided by a confidence map. Next, we explore two distinct decoding heads for simultaneously generating low-resolution predictions and offset fields. We then design an LOI module to directly upsample a low-resolution depth map to a full-resolution map. This one-step decoding framework enables accurate and efficient depth prediction. Finally, we evaluate our proposed method on the KITTI and Cityscapes benchmarks, conducting a comprehensive comparison with state-of-the-art approaches. The experimental results demonstrate that our DualDepth method achieves competitive performance in terms of both estimation accuracy and efficiency.
Collapse
|
2
|
Hu J, Fan C, Zhou L, Gao Q, Liu H, Lam TL. Lifelong-MonoDepth: Lifelong Learning for Multidomain Monocular Metric Depth Estimation. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2025; 36:796-806. [PMID: 37874732 DOI: 10.1109/tnnls.2023.3323487] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 10/26/2023]
Abstract
With the rapid advancements in autonomous driving and robot navigation, there is a growing demand for lifelong learning (LL) models capable of estimating metric (absolute) depth. LL approaches potentially offer significant cost savings in terms of model training, data storage, and collection. However, the quality of RGB images and depth maps is sensor-dependent, and depth maps in the real world exhibit domain-specific characteristics, leading to variations in depth ranges. These challenges limit existing methods to LL scenarios with small domain gaps and relative depth map estimation. To facilitate lifelong metric depth learning, we identify three crucial technical challenges that require attention: 1) developing a model capable of addressing the depth scale variation through scale-aware depth learning; 2) devising an effective learning strategy to handle significant domain gaps; and 3) creating an automated solution for domain-aware depth inference in practical applications. Based on the aforementioned considerations, in this article, we present 1) a lightweight multihead framework that effectively tackles the depth scale imbalance; 2) an uncertainty-aware LL solution that adeptly handles significant domain gaps; and 3) an online domain-specific predictor selection method for real-time inference. Through extensive numerical studies, we show that the proposed method can achieve good efficiency, stability, and plasticity, leading the benchmarks by 8%-15%. The code is available at https://github.com/FreeformRobotics/Lifelong-MonoDepth.
Collapse
|
3
|
Chen C, Wang B, Lu CX, Trigoni N, Markham A. Deep Learning for Visual Localization and Mapping: A Survey. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2024; 35:17000-17020. [PMID: 37738191 DOI: 10.1109/tnnls.2023.3309809] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 09/24/2023]
Abstract
Deep-learning-based localization and mapping approaches have recently emerged as a new research direction and receive significant attention from both industry and academia. Instead of creating hand-designed algorithms based on physical models or geometric theories, deep learning solutions provide an alternative to solve the problem in a data-driven way. Benefiting from the ever-increasing volumes of data and computational power on devices, these learning methods are fast evolving into a new area that shows potential to track self-motion and estimate environmental models accurately and robustly for mobile agents. In this work, we provide a comprehensive survey and propose a taxonomy for the localization and mapping methods using deep learning. This survey aims to discuss two basic questions: whether deep learning is promising for localization and mapping, and how deep learning should be applied to solve this problem. To this end, a series of localization and mapping topics are investigated, from the learning-based visual odometry and global relocalization to mapping, and simultaneous localization and mapping (SLAM). It is our hope that this survey organically weaves together the recent works in this vein from robotics, computer vision, and machine learning communities and serves as a guideline for future researchers to apply deep learning to tackle the problem of visual localization and mapping.
Collapse
|
4
|
Zhang Y, Gong M, Zhang M, Li J. Self-Supervised Monocular Depth Estimation With Self-Perceptual Anomaly Handling. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2024; 35:17292-17306. [PMID: 37581977 DOI: 10.1109/tnnls.2023.3301711] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 08/17/2023]
Abstract
It is attractive to extract plausible 3-D information from a single 2-D image, and self-supervised learning has shown impressive potential in this field. However, when only monocular videos are available as training data, moving objects at similar speeds to the camera can disturb the reprojection process during training. Existing methods filter out some moving pixels by comparing pixelwise photometric error, but the illumination inconsistency between frames leads to incomplete filtering. In addition, existing methods calculate photometric error within local windows, which leads to the fact that even if an anomalous pixel is masked out, it can still implicitly disturb the reprojection process, as long as it is in the local neighborhood of a nonanomalous pixel. Moreover, the ill-posed nature of monocular depth estimation makes the same scene correspond to multiple plausible depth maps, which damages the robustness of the model. In order to alleviate the above problems, we propose: 1) a self-reprojection mask to further filter out moving objects while avoiding illumination inconsistency; 2) a self-statistical mask method to prevent the filtered anomalous pixels from implicitly disturbing the reprojection; and 3) a self-distillation augmentation consistency loss to reduce the impact of ill-posed nature of monocular depth estimation. Our method shows superior performance on the KITTI dataset, especially when evaluating only the depth of potential moving objects.
Collapse
|
5
|
Liu X, Zhang T, Liu M. Joint estimation of pose, depth, and optical flow with a competition-cooperation transformer network. Neural Netw 2024; 171:263-275. [PMID: 38103436 DOI: 10.1016/j.neunet.2023.12.020] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/15/2023] [Revised: 10/31/2023] [Accepted: 12/12/2023] [Indexed: 12/19/2023]
Abstract
Estimating depth, ego-motion, and optical flow from consecutive frames is a critical task in robot navigation and has received significant attention in recent years. In this study, we propose PDF-Former, an unsupervised joint estimation network comprising a full transformer-based framework, as well as a competition and cooperation mechanism. The transformer framework captures global feature dependencies and is customized for different task types, thereby improving the performance of sequential tasks. The competition and cooperation mechanisms enable the network to obtain additional supervisory information at different training stages. Specifically, the competition mechanism is implemented early in training to achieve iterative optimization of 6 DOF poses (rotation and translation information from the target image to the two reference images), the depth of target image, and optical flow (from the target image to the two reference images) estimation in a competitive manner. In contrast, the cooperation mechanism is implemented later in training to facilitate the transmission of results among the three networks and mutually optimize the estimation results. We conducted experiments on the KITTI dataset, and the results indicate that PDF-Former has significant potential to enhance the accuracy and robustness of sequential tasks in robot navigation.
Collapse
Affiliation(s)
- Xiaochen Liu
- School of Instrument Science & Engineering, Southeast University, Nanjing, 210096, China; Key Laboratory of Micro-Inertial Instrument and Advanced Navigation Technology, Ministry of Education, Southeast University, Nanjing, 210096, Jiangsu, China
| | - Tao Zhang
- School of Instrument Science & Engineering, Southeast University, Nanjing, 210096, China; Key Laboratory of Micro-Inertial Instrument and Advanced Navigation Technology, Ministry of Education, Southeast University, Nanjing, 210096, Jiangsu, China.
| | - Mingming Liu
- Department of Orthopedic Surgery, The Second People's Hospital of Lianyungang, Lianyungang, 222003, Jiangsu, China; Department of Orthopedic Surgery, The First People's Hospital of Xining, Xining, 810000, Qinghai, China.
| |
Collapse
|
6
|
Zhang X, Zhao B, Yao J, Wu G. Unsupervised Monocular Depth and Camera Pose Estimation with Multiple Masks and Geometric Consistency Constraints. SENSORS (BASEL, SWITZERLAND) 2023; 23:5329. [PMID: 37300056 PMCID: PMC10255976 DOI: 10.3390/s23115329] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 04/17/2023] [Revised: 05/26/2023] [Accepted: 06/02/2023] [Indexed: 06/12/2023]
Abstract
This paper presents a novel unsupervised learning framework for estimating scene depth and camera pose from video sequences, fundamental to many high-level tasks such as 3D reconstruction, visual navigation, and augmented reality. Although existing unsupervised methods have achieved promising results, their performance suffers in challenging scenes such as those with dynamic objects and occluded regions. As a result, multiple mask technologies and geometric consistency constraints are adopted in this research to mitigate their negative impacts. Firstly, multiple mask technologies are used to identify numerous outliers in the scene, which are excluded from the loss computation. In addition, the identified outliers are employed as a supervised signal to train a mask estimation network. The estimated mask is then utilized to preprocess the input to the pose estimation network, mitigating the potential adverse effects of challenging scenes on pose estimation. Furthermore, we propose geometric consistency constraints to reduce the sensitivity of illumination changes, which act as additional supervised signals to train the network. Experimental results on the KITTI dataset demonstrate that our proposed strategies can effectively enhance the model's performance, outperforming other unsupervised methods.
Collapse
Affiliation(s)
- Xudong Zhang
- School of Information Science and Technology, Nantong University, Nantong 226019, China; (X.Z.); (G.W.)
| | - Baigan Zhao
- School of Mechanical Engineering, Nantong University, Nantong 226019, China;
| | - Jiannan Yao
- School of Mechanical Engineering, Nantong University, Nantong 226019, China;
| | - Guoqing Wu
- School of Information Science and Technology, Nantong University, Nantong 226019, China; (X.Z.); (G.W.)
| |
Collapse
|
7
|
Zhang Y, Gong M, Li J, Zhang M, Jiang F, Zhao H. Self-Supervised Monocular Depth Estimation With Multiscale Perception. IEEE TRANSACTIONS ON IMAGE PROCESSING : A PUBLICATION OF THE IEEE SIGNAL PROCESSING SOCIETY 2022; 31:3251-3266. [PMID: 35439134 DOI: 10.1109/tip.2022.3167307] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/14/2023]
Abstract
Extracting 3D information from a single optical image is very attractive. Recently emerging self-supervised methods can learn depth representations without using ground truth depth maps as training data by transforming the depth prediction task into an image synthesis task. However, existing methods rely on a differentiable bilinear sampler for image synthesis, which results in each pixel in a synthetic image being derived from only four pixels in the source image and causes each pixel in the depth map to perceive only a few pixels in the source image. In addition, when calculating the photometric error between a synthetic image and its corresponding target image, existing methods only consider the photometric error within a small neighborhood of each single pixel and therefore ignore correlations between larger areas, which causes the model to tend to fall into the local optima for small patches. In order to extend the perceptual area of the depth map over the source image, we propose a novel multi-scale method that downsamples the predicted depth map and performs image synthesis at different resolutions, which enables each pixel in the depth map to perceive more pixels in the source image and improves the performance of the model. As for the locality of photometric error, we propose a structural similarity (SSIM) pyramid loss to allow the model to sense the difference between images in multiple areas of different sizes. Experimental results show that our method achieves superior performance on both outdoor and indoor benchmarks.
Collapse
|
8
|
Zhao C, Tang Y, Sun Q. Unsupervised Monocular Depth Estimation in Highly Complex Environments. IEEE TRANSACTIONS ON EMERGING TOPICS IN COMPUTATIONAL INTELLIGENCE 2022. [DOI: 10.1109/tetci.2022.3182360] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022]
Affiliation(s)
- Chaoqiang Zhao
- Key Laboratory of Smart Manufacturing in Energy Chemical Process, Ministry of Education, East China University of Science and Technology, Shanghai, China
| | - Yang Tang
- Key Laboratory of Smart Manufacturing in Energy Chemical Process, Ministry of Education, East China University of Science and Technology, Shanghai, China
| | - Qiyu Sun
- Key Laboratory of Smart Manufacturing in Energy Chemical Process, Ministry of Education, East China University of Science and Technology, Shanghai, China
| |
Collapse
|