Abstract
Accurate monocular metric depth estimation (MMDE) is crucial to solving
downstream tasks in 3D perception and modeling. However, the remarkable
accuracy of recent MMDE methods is confined to their training domains. These
methods fail to generalize to unseen domains even in the presence of moderate
domain gaps, which hinders their practical applicability. We propose a new
model, UniDepthV2, capable of reconstructing metric 3D scenes from solely
single images across domains. Departing from the existing MMDE paradigm,
UniDepthV2 directly predicts metric 3D points from the input image at inference
time without any additional information, striving for a universal and flexible
MMDE solution. In particular, UniDepthV2 implements a self-promptable camera
module predicting a dense camera representation to condition depth features.
Our model exploits a pseudo-spherical output representation, which disentangles
the camera and depth representations. In addition, we propose a geometric
invariance loss that promotes the invariance of camera-prompted depth features.
UniDepthV2 improves its predecessor UniDepth model via a new edge-guided loss
which enhances the localization and sharpness of edges in the metric depth
outputs, a revisited, simplified and more efficient architectural design, and
an additional uncertainty-level output which enables downstream tasks requiring
confidence. Thorough evaluations on ten depth datasets in a zero-shot regime
consistently demonstrate the superior performance and generalization of
UniDepthV2. Code and models are available at
github.com/lpiccinelli-eth/UniDepth
BibTeX
@online{Piccinelli2502.20110, TITLE = {{UniDepthV2}: Universal Monocular Metric Depth Estimation Made Simpler}, AUTHOR = {Piccinelli, Luigi and Sakaridis, Christos and Yang, Yung-Hsu and Segu, Mattia and Li, Siyuan and Abbeloos, Wim and Van Gool, Luc}, LANGUAGE = {eng}, URL = {https://arxiv.org/abs/2502.20110}, EPRINT = {2502.20110}, EPRINTTYPE = {arXiv}, YEAR = {2025}, MARGINALMARK = {$\bullet$}, ABSTRACT = {Accurate monocular metric depth estimation (MMDE) is crucial to solving<br>downstream tasks in 3D perception and modeling. However, the remarkable<br>accuracy of recent MMDE methods is confined to their training domains. These<br>methods fail to generalize to unseen domains even in the presence of moderate<br>domain gaps, which hinders their practical applicability. We propose a new<br>model, UniDepthV2, capable of reconstructing metric 3D scenes from solely<br>single images across domains. Departing from the existing MMDE paradigm,<br>UniDepthV2 directly predicts metric 3D points from the input image at inference<br>time without any additional information, striving for a universal and flexible<br>MMDE solution. In particular, UniDepthV2 implements a self-promptable camera<br>module predicting a dense camera representation to condition depth features.<br>Our model exploits a pseudo-spherical output representation, which disentangles<br>the camera and depth representations. In addition, we propose a geometric<br>invariance loss that promotes the invariance of camera-prompted depth features.<br>UniDepthV2 improves its predecessor UniDepth model via a new edge-guided loss<br>which enhances the localization and sharpness of edges in the metric depth<br>outputs, a revisited, simplified and more efficient architectural design, and<br>an additional uncertainty-level output which enables downstream tasks requiring<br>confidence. Thorough evaluations on ten depth datasets in a zero-shot regime<br>consistently demonstrate the superior performance and generalization of<br>UniDepthV2. Code and models are available at<br>https://github.com/lpiccinelli-eth/UniDepth<br>}, }
Endnote
%0 Report %A Piccinelli, Luigi %A Sakaridis, Christos %A Yang, Yung-Hsu %A Segu, Mattia %A Li, Siyuan %A Abbeloos, Wim %A Van Gool, Luc %+ External Organizations External Organizations External Organizations Computer Vision and Machine Learning, MPI for Informatics, Max Planck Society External Organizations External Organizations External Organizations %T UniDepthV2: Universal Monocular Metric Depth Estimation Made Simpler : %G eng %U http://hdl.handle.net/21.11116/0000-0010-FAEA-D %U https://arxiv.org/abs/2502.20110 %D 2025 %X Accurate monocular metric depth estimation (MMDE) is crucial to solving<br>downstream tasks in 3D perception and modeling. However, the remarkable<br>accuracy of recent MMDE methods is confined to their training domains. These<br>methods fail to generalize to unseen domains even in the presence of moderate<br>domain gaps, which hinders their practical applicability. We propose a new<br>model, UniDepthV2, capable of reconstructing metric 3D scenes from solely<br>single images across domains. Departing from the existing MMDE paradigm,<br>UniDepthV2 directly predicts metric 3D points from the input image at inference<br>time without any additional information, striving for a universal and flexible<br>MMDE solution. In particular, UniDepthV2 implements a self-promptable camera<br>module predicting a dense camera representation to condition depth features.<br>Our model exploits a pseudo-spherical output representation, which disentangles<br>the camera and depth representations. In addition, we propose a geometric<br>invariance loss that promotes the invariance of camera-prompted depth features.<br>UniDepthV2 improves its predecessor UniDepth model via a new edge-guided loss<br>which enhances the localization and sharpness of edges in the metric depth<br>outputs, a revisited, simplified and more efficient architectural design, and<br>an additional uncertainty-level output which enables downstream tasks requiring<br>confidence. Thorough evaluations on ten depth datasets in a zero-shot regime<br>consistently demonstrate the superior performance and generalization of<br>UniDepthV2. Code and models are available at<br>https://github.com/lpiccinelli-eth/UniDepth<br> %K Computer Science, Computer Vision and Pattern Recognition, cs.CV