Publications - Current Year

2025

1
Conference paper
D2
V. Guzov, Y. Jiang, F. Hong, G. Pons-Moll, R. Newcombe, C. K. Liu, Y. Ye, and L. Ma
“HMD2: Environment-aware Motion Generation from Single Egocentric Head-Mounted Device,” in 3DV 2025, 12th International Conference on 3D Vision, Singapore.
2
Conference paper
D2
C. Li, J. Chibane, Y. He, N. Pearl, A. Geiger, and G. Pons-Moll
“Unimotion: Unifying 3D Human Motion Synthesis and Understanding,” in 3DV 2025, 12th International Conference on 3D Vision, Singapore.
3
Conference paper
D2
K. Raj, C. Wewer, R. Yunus, E. Ilg, and J. E. Lenssen
“Spurfies: Sparse-view Surface Reconstruction using Local Geometry Priors,” in 3DV 2025, International Conference on 3D Vision, Singapore.
4
Conference paper
D2
T. Wimmer, M. Oechsle, M. Niemeyer, and F. Tombari
“Gaussians-to-Life: Text-Driven Animation of 3D Gaussian Splatting Scenes,” in 3DV 2025, International Conference on 3D Vision, Singapore.
mehr
Abstract
State-of-the-art novel view synthesis methods achieve impressive results for
multi-view captures of static 3D scenes. However, the reconstructed scenes
still lack "liveliness," a key component for creating engaging 3D experiences.
Recently, novel video diffusion models generate realistic videos with complex
motion and enable animations of 2D images, however they cannot naively be used
to animate 3D scenes as they lack multi-view consistency. To breathe life into
the static world, we propose Gaussians2Life, a method for animating parts of
high-quality 3D scenes in a Gaussian Splatting representation. Our key idea is
to leverage powerful video diffusion models as the generative component of our
model and to combine these with a robust technique to lift 2D videos into
meaningful 3D motion. We find that, in contrast to prior work, this enables
realistic animations of complex, pre-existing 3D scenes and further enables the
animation of a large variety of object classes, while related work is mostly
focused on prior-based character animation, or single 3D objects. Our model
enables the creation of consistent, immersive 3D experiences for arbitrary
scenes.
5
Conference paper
D2
X. Xie, J. E. Lenssen, and G. Pons-Moll
“InterTrack: Tracking Human Object Interaction without Object Templates,” in 3DV 2025, International Conference on 3D Vision, Singapore.
6
Conference paper
D2
X. Zhang, B. L. Bhatnagar, S. Starke, I. A. Petrov, V. Guzov, H. Dhamo, E. Pérez Pellitero, and G. Pons-Moll
“FORCE: Dataset and Method for Intuitive Physics Guided Human-object Interaction,” in 3DV 2025, International Conference on 3D Vision, Singapore.
7
Article
D2
B. Andres, S. Di Gregorio, J. Irmai, and J.-H. Lange
“Corrigendum to ‘A polyhedral study of lifted multicuts’ [Discrete Optim. 47 (2023) 100757],” Discrete Optimization, vol. 55, 2025.
8
Conference paper
D2
M. Asim, C. Wewer, T. Wimmer, B. Schiele, and J. E. Lenssen
“MEt3R: Measuring Multi-View Consistency in Generated Images,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2025), Nashville, TN, USA.
9
Conference paper
D2
P. Flotho, M. Piening, A. Kukleva, and G. Steidl
“T-FAKE: Synthesizing Thermal Images for Facial Landmarking,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2025), Nashville, TN, USA.
10
Conference paper
D2
F. Hong, V. Guzov, H. J. Kim, Y. Ye, R. Newcombe, Z. Liu, and L. Ma
“EgoLM: Multi-Modal Language Model of Egocentric Motions,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2025), Nashville, TN, USA.
11
Conference paper
D2
X. Hu, H. Wang, J. E. Lenssen, and B. Schiele
“PersonaHOI: Effortlessly Improving Personalized Face with Human-Object Interaction Generation,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2025), Nashville, TN, USA.
mehr
Abstract
We introduce PersonaHOI, a training- and tuning-free framework that fuses a
general StableDiffusion model with a personalized face diffusion (PFD) model to
generate identity-consistent human-object interaction (HOI) images. While
existing PFD models have advanced significantly, they often overemphasize
facial features at the expense of full-body coherence, PersonaHOI introduces an
additional StableDiffusion (SD) branch guided by HOI-oriented text inputs. By
incorporating cross-attention constraints in the PFD branch and spatial merging
at both latent and residual levels, PersonaHOI preserves personalized facial
details while ensuring interactive non-facial regions. Experiments, validated
by a novel interaction alignment metric, demonstrate the superior realism and
scalability of PersonaHOI, establishing a new standard for practical
personalized face with HOI generation. Our code will be available at
github.com/JoyHuYY1412/PersonaHOI
12
Conference paper
D2
N. Shvetsova, A. Nagrani, B. Schiele, H. Kuehne, and C. Rupprecht
“Unbiasing through Textual Descriptions: Mitigating Representation Bias in Video Benchmarks,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2025), Nashville, TN, USA.
13
Conference paper
D2
F. Vogel, W. Bousselham, A. Kukleva, N. Shvetsova, and H. Kuehne
“VideoGEM: Training-free Action Grounding in Videos,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2025), Nashville, TN, USA.
14
Conference paper
D2
Y. Wu, X. Hu, Y. Sun, Y. Zhou, W. Zhu, F. Rao, B. Schiele, and X. Yang
“Number it: Temporal Grounding Videos like Flipping Manga,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2025), Nashville, TN, USA.
mehr
Abstract
We introduce PersonaHOI, a training- and tuning-free framework that fuses a
general StableDiffusion model with a personalized face diffusion (PFD) model to
generate identity-consistent human-object interaction (HOI) images. While
existing PFD models have advanced significantly, they often overemphasize
facial features at the expense of full-body coherence, PersonaHOI introduces an
additional StableDiffusion (SD) branch guided by HOI-oriented text inputs. By
incorporating cross-attention constraints in the PFD branch and spatial merging
at both latent and residual levels, PersonaHOI preserves personalized facial
details while ensuring interactive non-facial regions. Experiments, validated
by a novel interaction alignment metric, demonstrate the superior realism and
scalability of PersonaHOI, establishing a new standard for practical
personalized face with HOI generation. Our code will be available at
github.com/JoyHuYY1412/PersonaHOI
15
Conference paper
D2
J. Xie, A. Tonioni, N. Rauschmayr, F. Tombari, and B. Schiele
“Test-Time Visual In-Context Tuning,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2025), Nashville, TN, USA.
16
Conference paper
D2
T. Medi, S. Jung, and M. Keuper
“FAIR-TAT: Improving Model Fairness Using Targeted Adversarial Training,” in IEEE/CVF Winter Conference on Applications of Computer Vision (WACV 2025), Tucson, AZ, USA, 2025.
17
Conference paper
D2
K. Prasse, I. Bravo, S. Walter, and M. Keuper
“I Spy with My Little Eye: A Minimum Cost Multicut Investigation of Dataset Frames,” in IEEE/CVF Winter Conference on Applications of Computer Vision (WACV 2025), Tucson, AZ, USA, 2025.
18
Conference paper
D2
Y. Liu, C. Graf, M. Spies, and M. Keuper
“Segment any Repeated Object,” in IEEE International Conference on Robotics and Automation (ICRA 2025), Hyderabad, India.
19
Article
D2
Q. Fan, M. Segu, B. Schiele, D. Dai, Y.-W. Tai, and C.-K. Tang
“Robust Object Detection with Domain-Invariant Training and Continual Test-Time Adaptation,” International Journal of Computer Vision, 2025.
20
Article
D2
J. Lukasik, M. Moeller, and M. Keuper
“An Evaluation of Zero-Cost Proxies - From Neural Architecture Performance Prediction to Model Robustness,” International Journal of Computer Vision, vol. 133, 2025.
21
Conference paper
D2
A. Anani, T. Lorenz, M. Fritz, and B. Schiele
“Pixel-level Certified Explanations via Randomized Smoothing,” in Proceedings of the 42nd International Conference on Machine Learning (ICML 2025), Vancouver, Canada.
22
Conference paper
D2
R. Hesse, J. Fischer, S. Schaub-Meyer, and S. Roth
“Disentangling Polysemantic Channels in Convolutional Neural Networks,” in The First Workshop on Mechanistic Interpretability for Vision (MIV 2025), Nashville, TN, USA.
23
Conference paper
D2
I. Hossain, J. Fischer, R. Burkholz, and J. Quackenbush
“Pruning Neural Network Models for Gene Regulatory Dynamics Using Data and Domain Knowledge,” in The Second Conference on Parsimony and Learning Recent Spotlight Track (CPAL 2025), Stanford, CA, USA, 2025.
24
Conference paper
D2
Y. Li, W. Beluch, M. Keuper, D. Zhang, and A. Khoreva
“VSTAR: Generative Temporal Nursing for Longer Dynamic Video Synthesis,” in The Thirteenth International Conference on Learning Representations (ICLR 2025), Singapore, 2025.
25
Conference paper
D2
M. Segu, L. Piccinelli, S. Li, Y.-H. Yang, B. Schiele, and L. Van Gool
“Samba: Synchronized Set-of-Sequences Modeling for Multiple Object Tracking,” in The Thirteenth International Conference on Learning Representations (ICLR 2025), Singapore, 2025.
26
Conference paper
D2
S. Gairola, M. Böhle, F. Locatello, and B. Schiele
“How to Probe: Simple Yet Effective Techniques for Improving Post-hoc Explanations,” in The Thirteenth International Conference on Learning Representations (ICLR 2025 ), Singapore, 2025.
27
Conference paper
D2
P. Gavrikov, J. Lukasik, S. Jung, R. Geirhos, M. J. Mirza, M. Keuper, and J. Keuper
“Can We Talk Models Into Seeing the World Differently?,” in The Thirteenth International Conference on Learning Representations (ICLR 2025 ), Singapore, 2025.
28
Conference paper
D2
H. Wang, Y. Fan, M. F. Naeem, Y. Xian, J. E. Lenssen, L. Wang, F. Tombari, and B. Schiele
“TokenFormer: Rethinking Transformer Scaling with Tokenized Model Parameters,” in Thirteenth International Conference on Learning Representations (ICLR 2025), Singapore.
29
Conference paper
D2
Y. Yuan, Z. Zhang, X. He, A. Nitta, W. Hu, D. Wang, M. Shah, S. Huang, B. Stojanovič, A. Krumholz, J. E. Lenssen, J. Leskovec, and M. Fey
“ContextGNN: Beyond Two-Tower Recommendation Systems,” in Thirteenth International Conference on Learning Representations (ICLR 2025), Singapore.
30
Paper
D2
S. Agnihotri, D. Schader, N. Sharei, M. E. Kaçar, and M. Keuper
“Are Synthetic Corruptions A Reliable Proxy For Real-World Corruptions?,” 2025. [Online]. Available: https://arxiv.org/abs/2505.04835.
mehr
Abstract
Deep learning (DL) models are widely used in real-world applications but remain vulnerable to distribution shifts, especially due to weather and lighting changes. Collecting diverse real-world data for testing the robustness of DL models is resource-intensive, making synthetic corruptions an attractive alternative for robustness testing. However, are synthetic corruptions a reliable proxy for real-world corruptions? To answer this, we conduct the largest benchmarking study on semantic segmentation models, comparing performance on real-world corruptions and synthetic corruptions datasets. Our results reveal a strong correlation in mean performance, supporting the use of synthetic corruptions for robustness evaluation. We further analyze corruption-specific correlations, providing key insights to understand when synthetic corruptions succeed in representing real-world corruptions. Open-source Code: github.com/shashankskagnihotri/benchmarking_robustness/tree/segmentation_david/semantic_segmentation
31
Paper
D2
S. Agnihotri, A. Ansari, A. Dackermann, F. Rösch, and M. Keuper
“DispBench: Benchmarking Disparity Estimation to Synthetic Corruptions,” 2025. [Online]. Available: https://arxiv.org/abs/2505.05091.
mehr
Abstract
Deep learning (DL) has surpassed human performance on standard benchmarks, driving its widespread adoption in computer vision tasks. One such task is disparity estimation, estimating the disparity between matching pixels in stereo image pairs, which is crucial for safety-critical applications like medical surgeries and autonomous navigation. However, DL-based disparity estimation methods are highly susceptible to distribution shifts and adversarial attacks, raising concerns about their reliability and generalization. Despite these concerns, a standardized benchmark for evaluating the robustness of disparity estimation methods remains absent, hindering progress in the field.
To address this gap, we introduce DispBench, a comprehensive benchmarking tool for systematically assessing the reliability of disparity estimation methods. DispBench evaluates robustness against synthetic image corruptions such as adversarial attacks and out-of-distribution shifts caused by 2D Common Corruptions across multiple datasets and diverse corruption scenarios. We conduct the most extensive performance and robustness analysis of disparity estimation methods to date, uncovering key correlations between accuracy, reliability, and generalization. Open-source code for DispBench: github.com/shashankskagnihotri/benchmarking_robustness/tree/disparity_estimation/final/disparity_estimation
32
Paper
D2
S. Agnihotri, D. Schader, J. Jakubassa, N. Sharei, S. Kral, M. E. Kaçar, R. Weber, and M. Keuper
“SemSegBench & DetecBench: Benchmarking Reliability and Generalization Beyond Classification,” 2025. .
mehr
Abstract
Reliability and generalization in deep learning are predominantly studied in the context of image classification. Yet, real-world applications in safety-critical domains involve a broader set of semantic tasks, such as semantic segmentation and object detection, which come with a diverse set of dedicated model architectures. To facilitate research towards robust model design in segmentation and detection, our primary objective is to provide benchmarking tools regarding robustness to distribution shifts and adversarial manipulations. We propose the benchmarking tools SEMSEGBENCH and DETECBENCH, along with the most extensive evaluation to date on the reliability and generalization of semantic segmentation and object detection models. In particular, we benchmark 76 segmentation models across four datasets and 61 object detectors across two datasets, evaluating their performance under diverse adversarial attacks and common corruptions. Our findings reveal systematic weaknesses in state-of-the-art models and uncover key trends based on architecture, backbone, and model capacity. SEMSEGBENCH and DETECBENCH are open-sourced in our GitHub repository (https://github.com/shashankskagnihotri/benchmarking_reliability_generalization) along with our complete set of total 6139 evaluations. We anticipate the collected data to foster and encourage future research towards improved model reliability beyond classification.
33
Paper
D2
J. Belouadi, E. Ilg, M. Keuper, H. Tanaka, M. Utiyama, R. Dabre, S. Eger, and S. P. Ponzetto
“TikZero: Zero-Shot Text-Guided Graphics Program Synthesis,” 2025. [Online]. Available: https://arxiv.org/abs/2503.11509.
mehr
Abstract
With the rise of generative AI, synthesizing figures from text captions
becomes a compelling application. However, achieving high geometric precision
and editability requires representing figures as graphics programs in languages
like TikZ, and aligned training data (i.e., graphics programs with captions)
remains scarce. Meanwhile, large amounts of unaligned graphics programs and
captioned raster images are more readily available. We reconcile these
disparate data sources by presenting TikZero, which decouples graphics program
generation from text understanding by using image representations as an
intermediary bridge. It enables independent training on graphics programs and
captioned images and allows for zero-shot text-guided graphics program
synthesis during inference. We show that our method substantially outperforms
baselines that can only operate with caption-aligned graphics programs.
Furthermore, when leveraging caption-aligned graphics programs as a
complementary training signal, TikZero matches or exceeds the performance of
much larger models, including commercial systems like GPT-4o. Our code,
datasets, and select models are publicly available.
34
Paper
D2
C. Chen, E. Saha, J. Fischer, M. B. Guebila, V. Fanfani, K. H. Shutta, M. Padi, K. Glass, D. L. DeMeo, C. M. Lopes-Ramos, and J. Quackenbush
“Identifying Sex Differences in Lung Adenocarcinoma Using Multi-Omics Integrative Protein Signaling Networks.” bioRxiv, 2025.
35
Paper
D2
J. Erbach, D. Narnhofer, A. Dombos, B. Schiele, J. E. Lenssen, and K. Schindler
“Solving Inverse Problems with FLAIR,” 2025. [Online]. Available: https://arxiv.org/abs/2506.02680.
mehr
Abstract
Flow-based latent generative models such as Stable Diffusion 3 are able to
generate images with remarkable quality, even enabling photorealistic
text-to-image generation. Their impressive performance suggests that these
models should also constitute powerful priors for inverse imaging problems, but
that approach has not yet led to comparable fidelity. There are several key
obstacles: (i) the encoding into a lower-dimensional latent space makes the
underlying (forward) mapping non-linear; (ii) the data likelihood term is
usually intractable; and (iii) learned generative models struggle to recover
rare, atypical data modes during inference. We present FLAIR, a novel training
free variational framework that leverages flow-based generative models as a
prior for inverse problems. To that end, we introduce a variational objective
for flow matching that is agnostic to the type of degradation, and combine it
with deterministic trajectory adjustments to recover atypical modes. To enforce
exact consistency with the observed data, we decouple the optimization of the
data fidelity and regularization terms. Moreover, we introduce a time-dependent
calibration scheme in which the strength of the regularization is modulated
according to off-line accuracy estimates. Results on standard imaging
benchmarks demonstrate that FLAIR consistently outperforms existing diffusion-
and flow-based methods in terms of reconstruction quality and sample diversity.
36
Paper
D2
M. Fatima, S. Jung, and M. Keuper
“Corner Cases: How Size and Position of Objects Challenge ImageNet-Trained Models,” 2025. [Online]. Available: https://arxiv.org/abs/2505.03569.
mehr
Abstract
Backgrounds in images play a major role in contributing to spurious correlations among different data points. Owing to aesthetic preferences of humans capturing the images, datasets can exhibit positional (location of the object within a given frame) and size (region-of-interest to image ratio) biases for different classes. In this paper, we show that these biases can impact how much a model relies on spurious features in the background to make its predictions. To better illustrate our findings, we propose a synthetic dataset derived from ImageNet1k, Hard-Spurious-ImageNet, which contains images with various backgrounds, object positions, and object sizes. By evaluating the dataset on different pretrained models, we find that most models rely heavily on spurious features in the background when the region-of-interest (ROI) to image ratio is small and the object is far from the center of the image. Moreover, we also show that current methods that aim to mitigate harmful spurious features, do not take into account these factors, hence fail to achieve considerable performance gains for worst-group accuracies when the size and location of core features in an image change.
37
Paper
D2
A. Görgün, B. Schiele, and J. Fischer
“VITAL: More Understandable Feature Visualization through Distribution Alignment and Relevant Information Flow,” 2025. [Online]. Available: https://arxiv.org/abs/2503.22399.
mehr
Abstract
Neural networks are widely adopted to solve complex and challenging tasks.
Especially in high-stakes decision-making, understanding their reasoning
process is crucial, yet proves challenging for modern deep networks. Feature
visualization (FV) is a powerful tool to decode what information neurons are
responding to and hence to better understand the reasoning behind such
networks. In particular, in FV we generate human-understandable images that
reflect the information detected by neurons of interest. However, current
methods often yield unrecognizable visualizations, exhibiting repetitive
patterns and visual artifacts that are hard to understand for a human. To
address these problems, we propose to guide FV through statistics of real image
features combined with measures of relevant network flow to generate
prototypical images. Our approach yields human-understandable visualizations
that both qualitatively and quantitatively improve over state-of-the-art FVs
across various architectures. As such, it can be used to decode which
information the network uses, complementing mechanistic circuits that identify
where it is encoded. Code is available at: github.com/adagorgun/VITAL
38
Paper
D2
R. Hesse, D. Bağcı, B. Schiele, S. Schaub-Meyer, and S. Roth
“Beyond Accuracy: What Matters in Designing Well-Behaved Models?,” 2025. [Online]. Available: https://arxiv.org/abs/2503.17110.
mehr
Abstract
Deep learning has become an essential part of computer vision, with deep
neural networks (DNNs) excelling in predictive performance. However, they often
fall short in other critical quality dimensions, such as robustness,
calibration, or fairness. While existing studies have focused on a subset of
these quality dimensions, none have explored a more general form of
"well-behavedness" of DNNs. With this work, we address this gap by
simultaneously studying nine different quality dimensions for image
classification. Through a large-scale study, we provide a bird's-eye view by
analyzing 326 backbone models and how different training paradigms and model
architectures affect the quality dimensions. We reveal various new insights
such that (i) vision-language models exhibit high fairness on ImageNet-1k
classification and strong robustness against domain changes; (ii)
self-supervised learning is an effective training paradigm to improve almost
all considered quality dimensions; and (iii) the training dataset size is a
major driver for most of the quality dimensions. We conclude our study by
introducing the QUBA score (Quality Understanding Beyond Accuracy), a novel
metric that ranks models across multiple dimensions of quality, enabling
tailored recommendations based on specific user needs.
39
Paper
D2
C. Leiter, Y. M. Asano, M. Keuper, and S. Eger
“CROC: Evaluating and Training T2I Metrics with Pseudo- and Human-Labeled Contrastive Robustness Checks,” 2025. .
mehr
Abstract
The assessment of evaluation metrics (meta-evaluation) is crucial for determining the suitability of existing metrics in text-to-image (T2I) generation tasks. Human-based meta-evaluation is costly and time-intensive, and automated alternatives are scarce. We address this gap and propose CROC: a scalable framework for automated Contrastive Robustness Checks that systematically probes and quantifies metric robustness by synthesizing contrastive test cases across a comprehensive taxonomy of image properties. With CROC, we generate a pseudo-labeled dataset (CROC$^{syn}$) of over one million contrastive prompt-image pairs to enable a fine-grained comparison of evaluation metrics. We also use the dataset to train CROCScore, a new metric that achieves state-of-the-art performance among open-source methods, demonstrating an additional key application of our framework. To complement this dataset, we introduce a human-supervised benchmark (CROC$^{hum}$) targeting especially challenging categories. Our results highlight robustness issues in existing metrics: for example, many fail on prompts involving negation, and all tested open-source metrics fail on at least 25% of cases involving correct identification of body parts.
40
Paper
D2
P. Müller, A. Braun, and M. Keuper
“Examining the Impact of Optical Aberrations to Image Classification and Object Detection Models,” 2025. [Online]. Available: https://arxiv.org/abs/2504.18510.
mehr
Abstract
Deep neural networks (DNNs) have proven to be successful in various computer
vision applications such that models even infer in safety-critical situations.
Therefore, vision models have to behave in a robust way to disturbances such as
noise or blur. While seminal benchmarks exist to evaluate model robustness to
diverse corruptions, blur is often approximated in an overly simplistic way to
model defocus, while ignoring the different blur kernel shapes that result from
optical systems. To study model robustness against realistic optical blur
effects, this paper proposes two datasets of blur corruptions, which we denote
OpticsBench and LensCorruptions. OpticsBench examines primary aberrations such
as coma, defocus, and astigmatism, i.e. aberrations that can be represented by
varying a single parameter of Zernike polynomials. To go beyond the principled
but synthetic setting of primary aberrations, LensCorruptions samples linear
combinations in the vector space spanned by Zernike polynomials, corresponding
to 100 real lenses. Evaluations for image classification and object detection
on ImageNet and MSCOCO show that for a variety of different pre-trained models,
the performance on OpticsBench and LensCorruptions varies significantly,
indicating the need to consider realistic image corruptions to evaluate a
model's robustness against blur.
41
Paper
D2D6
N. Pham, B. Schiele, A. Kortylewski, and J. Fischer
“Escaping Plato’s Cave: Robust Conceptual Reasoning through Interpretable 3D Neural Object Volumes,” 2025. [Online]. Available: https://arxiv.org/abs/2503.13429.
mehr
Abstract
With the rise of neural networks, especially in high-stakes applications,
these networks need two properties (i) robustness and (ii) interpretability to
ensure their safety. Recent advances in classifiers with 3D volumetric object
representations have demonstrated a greatly enhanced robustness in
out-of-distribution data. However, these 3D-aware classifiers have not been
studied from the perspective of interpretability. We introduce CAVE - Concept
Aware Volumes for Explanations - a new direction that unifies interpretability
and robustness in image classification. We design an inherently-interpretable
and robust classifier by extending existing 3D-aware classifiers with concepts
extracted from their volumetric representations for classification. In an array
of quantitative metrics for interpretability, we compare against different
concept-based approaches across the explainable AI literature and show that
CAVE discovers well-grounded concepts that are used consistently across images,
while achieving superior robustness.
42
Paper
D2
L. Piccinelli, C. Sakaridis, M. Segu, Y.-H. Yang, S. Li, W. Abbeloos, and L. Van Gool
“UniK3D: Universal Camera Monocular 3D Estimation,” 2025. [Online]. Available: https://arxiv.org/abs/2503.16591.
mehr
Abstract
Monocular 3D estimation is crucial for visual perception. However, current
methods fall short by relying on oversimplified assumptions, such as pinhole
camera models or rectified images. These limitations severely restrict their
general applicability, causing poor performance in real-world scenarios with
fisheye or panoramic images and resulting in substantial context loss. To
address this, we present UniK3D, the first generalizable method for monocular
3D estimation able to model any camera. Our method introduces a spherical 3D
representation which allows for better disentanglement of camera and scene
geometry and enables accurate metric 3D reconstruction for unconstrained camera
models. Our camera component features a novel, model-independent representation
of the pencil of rays, achieved through a learned superposition of spherical
harmonics. We also introduce an angular loss, which, together with the camera
module design, prevents the contraction of the 3D outputs for wide-view
cameras. A comprehensive zero-shot evaluation on 13 diverse datasets
demonstrates the state-of-the-art performance of UniK3D across 3D, depth, and
camera metrics, with substantial gains in challenging large-field-of-view and
panoramic settings, while maintaining top accuracy in conventional pinhole
small-field-of-view domains. Code and models are available at
github.com/lpiccinelli-eth/unik3d .
43
Paper
D2
L. Piccinelli, C. Sakaridis, Y.-H. Yang, M. Segu, S. Li, W. Abbeloos, and L. Van Gool
“UniDepthV2: Universal Monocular Metric Depth Estimation Made Simpler,” 2025. [Online]. Available: https://arxiv.org/abs/2502.20110.
mehr
Abstract
Accurate monocular metric depth estimation (MMDE) is crucial to solving
downstream tasks in 3D perception and modeling. However, the remarkable
accuracy of recent MMDE methods is confined to their training domains. These
methods fail to generalize to unseen domains even in the presence of moderate
domain gaps, which hinders their practical applicability. We propose a new
model, UniDepthV2, capable of reconstructing metric 3D scenes from solely
single images across domains. Departing from the existing MMDE paradigm,
UniDepthV2 directly predicts metric 3D points from the input image at inference
time without any additional information, striving for a universal and flexible
MMDE solution. In particular, UniDepthV2 implements a self-promptable camera
module predicting a dense camera representation to condition depth features.
Our model exploits a pseudo-spherical output representation, which disentangles
the camera and depth representations. In addition, we propose a geometric
invariance loss that promotes the invariance of camera-prompted depth features.
UniDepthV2 improves its predecessor UniDepth model via a new edge-guided loss
which enhances the localization and sharpness of edges in the metric depth
outputs, a revisited, simplified and more efficient architectural design, and
an additional uncertainty-level output which enables downstream tasks requiring
confidence. Thorough evaluations on ten depth datasets in a zero-shot regime
consistently demonstrate the superior performance and generalization of
UniDepthV2. Code and models are available at
github.com/lpiccinelli-eth/UniDepth
44
Paper
D2
K. Prasse, M. Kleinmann, I. Adam, K. Beckersjuergen, A. Edte, J. Frroku, T. Gumpp, S. Jung, I. Bravo, S. Walter, and M. Keuper
“Deep Learning for Climate Action: Computer Vision Analysis of Visual Narratives on X,” 2025. [Online]. Available: https://arxiv.org/abs/2503.09361.
mehr
Abstract
Climate change is one of the most pressing challenges of the 21st century,
sparking widespread discourse across social media platforms. Activists,
policymakers, and researchers seek to understand public sentiment and
narratives while access to social media data has become increasingly restricted
in the post-API era. In this study, we analyze a dataset of climate
change-related tweets from X (formerly Twitter) shared in 2019, containing 730k
tweets along with the shared images. Our approach integrates statistical
analysis, image classification, object detection, and sentiment analysis to
explore visual narratives in climate discourse. Additionally, we introduce a
graphical user interface (GUI) to facilitate interactive data exploration. Our
findings reveal key themes in climate communication, highlight sentiment
divergence between images and text, and underscore the strengths and
limitations of foundation models in analyzing social media imagery. By
releasing our code and tools, we aim to support future research on the
intersection of climate change, social media, and computer vision.
45
Paper
D2
K. Prasse, P. Knab, S. Marton, C. Bartelt, and M. Keuper
“DCBM: Data-Efficient Visual Concept Bottleneck Models,” 2025. [Online]. Available: https://arxiv.org/abs/2412.11576.
mehr
Abstract
Concept Bottleneck Models (CBMs) enhance the interpretability of neural
networks by basing predictions on human-understandable concepts. However,
current CBMs typically rely on concept sets extracted from large language
models or extensive image corpora, limiting their effectiveness in data-sparse
scenarios. We propose Data-efficient CBMs (DCBMs), which reduce the need for
large sample sizes during concept generation while preserving interpretability.
DCBMs define concepts as image regions detected by segmentation or detection
foundation models, allowing each image to generate multiple concepts across
different granularities. This removes reliance on textual descriptions and
large-scale pre-training, making DCBMs applicable for fine-grained
classification and out-of-distribution tasks. Attribution analysis using
Grad-CAM demonstrates that DCBMs deliver visual concepts that can be localized
in test images. By leveraging dataset-specific concepts instead of predefined
ones, DCBMs enhance adaptability to new domains.
46
Paper
D2
R. Qorbani, G. Villani, T. Panagiotakopoulos, M. B. Colomer, L. Härenstam-Nielsen, M. Segu, P. L. Dovesi, J. Karlgren, D. Cremers, F. Tombari, and M. Poggi
“Semantic Library Adaptation: LoRA Retrieval and Fusion for Open-Vocabulary Semantic Segmentation,” 2025. [Online]. Available: https://arxiv.org/abs/2503.21780.
mehr
Abstract
Open-vocabulary semantic segmentation models associate vision and text to
label pixels from an undefined set of classes using textual queries, providing
versatile performance on novel datasets. However, large shifts between training
and test domains degrade their performance, requiring fine-tuning for effective
real-world applications. We introduce Semantic Library Adaptation (SemLA), a
novel framework for training-free, test-time domain adaptation. SemLA leverages
a library of LoRA-based adapters indexed with CLIP embeddings, dynamically
merging the most relevant adapters based on proximity to the target domain in
the embedding space. This approach constructs an ad-hoc model tailored to each
specific input without additional training. Our method scales efficiently,
enhances explainability by tracking adapter contributions, and inherently
protects data privacy, making it ideal for sensitive applications.
Comprehensive experiments on a 20-domain benchmark built over 10 standard
datasets demonstrate SemLA's superior adaptability and performance across
diverse settings, establishing a new standard in domain adaptation for
open-vocabulary semantic segmentation.
47
Paper
D2
F. Sammani, J. Fischer, and N. Deligiannis
“Unlocking Open-Set Language Accessibility in Vision Models,” 2025. [Online]. Available: https://arxiv.org/abs/2503.10981.
48
Paper
D2
J. Schmalfuss, V. Oei, L. Mehl, M. Bartsch, S. Agnihotri, M. Keuper, and A. Bruhn
“RobustSpring: Benchmarking Robustness to Image Corruptions for Optical Flow, Scene Flow and Stereo,” 2025. .
mehr
Abstract
Standard benchmarks for optical flow, scene flow, and stereo vision algorithms generally focus on model accuracy rather than robustness to image corruptions like noise or rain. Hence, the resilience of models to such real-world perturbations is largely unquantified. To address this, we present RobustSpring, a comprehensive dataset and benchmark for evaluating robustness to image corruptions for optical flow, scene flow, and stereo models. RobustSpring applies 20 different image corruptions, including noise, blur, color changes, quality degradations, and weather distortions, in a time-, stereo-, and depth-consistent manner to the high-resolution Spring dataset, creating a suite of 20,000 corrupted images that reflect challenging conditions. RobustSpring enables comparisons of model robustness via a new corruption robustness metric. Integration with the Spring benchmark enables public two-axis evaluations of both accuracy and robustness. We benchmark a curated selection of initial models, observing that accurate models are not necessarily robust and that robustness varies widely by corruption type. RobustSpring is a new computer vision benchmark that treats robustness as a first-class citizen to foster models that combine accuracy with resilience. It will be available at spring-benchmark.org.
49
Paper
D2
N. P. Walter, J. Vreeken, and J. Fischer
“Now You See Me! A Framework for Obtaining Class-relevant Saliency Maps,” 2025. [Online]. Available: https://arxiv.org/abs/2503.07346.
50
Paper
D2RG3
Y. Wang, S. Rao, J.-U. Lee, M. Jobanputra, and V. Demberg
“B-cos LM: Efficiently Transforming Pre-trained Language Models for Improved Explainability,” 2025. [Online]. Available: https://arxiv.org/abs/2502.12992.
mehr
Abstract
Post-hoc explanation methods for black-box models often struggle with
faithfulness and human interpretability due to the lack of explainability in
current neural models. Meanwhile, B-cos networks have been introduced to
improve model explainability through architectural and computational
adaptations, but their application has so far been limited to computer vision
models and their associated training pipelines. In this work, we introduce
B-cos LMs, i.e., B-cos networks empowered for NLP tasks. Our approach directly
transforms pre-trained language models into B-cos LMs by combining B-cos
conversion and task fine-tuning, improving efficiency compared to previous
B-cos methods. Our automatic and human evaluation results demonstrate that
B-cos LMs produce more faithful and human interpretable explanations than post
hoc methods, while maintaining task performance comparable to conventional
fine-tuning. Our in-depth analysis explores how B-cos LMs differ from
conventionally fine-tuned models in their learning processes and explanation
patterns. Finally, we provide practical guidelines for effectively building
B-cos LMs based on our findings. Our code is available at
anonymous.4open.science/r/bcos_lm.
51
Paper
D2
C. Wewer, B. Pogodzinski, B. Schiele, and J. E. Lenssen
“Spatial Reasoning with Denoising Models,” 2025. [Online]. Available: https://www.arxiv.org/abs/2502.21075.
mehr
Abstract
We introduce Spatial Reasoning Models (SRMs), a framework to perform
reasoning over sets of continuous variables via denoising generative models.
SRMs infer continuous representations on a set of unobserved variables, given
observations on observed variables. Current generative models on spatial
domains, such as diffusion and flow matching models, often collapse to
hallucination in case of complex distributions. To measure this, we introduce a
set of benchmark tasks that test the quality of complex reasoning in generative
models and can quantify hallucination. The SRM framework allows to report key
findings about importance of sequentialization in generation, the associated
order, as well as the sampling strategies during training. It demonstrates, for
the first time, that order of generation can successfully be predicted by the
denoising network itself. Using these findings, we can increase the accuracy of
specific reasoning tasks from 1% to >50%.
52
Paper
D2
Y. Wu, Z. Li, X. Hu, X. Ye, X. Zeng, G. Yu, W. Zhu, B. Schiele, M.-H. Yang, and X. Yang
“KRIS-Bench: Benchmarking Next-Level Intelligent Image Editing Models,” 2025. [Online]. Available: https://arxiv.org/abs/2505.16707.
mehr
Abstract
Recent advances in multi-modal generative models have enabled significant
progress in instruction-based image editing. However, while these models
produce visually plausible outputs, their capacity for knowledge-based
reasoning editing tasks remains under-explored. In this paper, we introduce
KRIS-Bench (Knowledge-based Reasoning in Image-editing Systems Benchmark), a
diagnostic benchmark designed to assess models through a cognitively informed
lens. Drawing from educational theory, KRIS-Bench categorizes editing tasks
across three foundational knowledge types: Factual, Conceptual, and Procedural.
Based on this taxonomy, we design 22 representative tasks spanning 7 reasoning
dimensions and release 1,267 high-quality annotated editing instances. To
support fine-grained evaluation, we propose a comprehensive protocol that
incorporates a novel Knowledge Plausibility metric, enhanced by knowledge hints
and calibrated through human studies. Empirical results on 10 state-of-the-art
models reveal significant gaps in reasoning performance, highlighting the need
for knowledge-centric benchmarks to advance the development of intelligent
image editing systems.
53
Paper
D2
J. Xu, O. Kao, and M. Keuper
“Informed Mixing -- Improving Open Set Recognition via Attribution-based Augmentation,” 2025. .
mehr
Abstract
Open set recognition (OSR) is devised to address the problem of detecting novel classes during model inference. Even in recent vision models, this remains an open issue which is receiving increasing attention. Thereby, a crucial challenge is to learn features that are relevant for unseen categories from given data, for which these features might not be discriminative. To facilitate this process and "optimize to learn" more diverse features, we propose GradMix, a data augmentation method that dynamically leverages gradient-based attribution maps of the model during training to mask out already learned concepts. Thus GradMix encourages the model to learn a more complete set of representative features from the same data source. Extensive experiments on open set recognition, close set classification, and out-of-distribution detection reveal that our method can often outperform the state-of-the-art. GradMix can further increase model robustness to corruptions as well as downstream classification performance for self-supervised learning, indicating its benefit for model generalization.

Abstract

Abstract

Abstract

Abstract

Abstract

Abstract

Abstract

Abstract

Abstract

Abstract

Abstract

Abstract

Abstract

Abstract

Abstract

Abstract

Abstract

Abstract

Abstract

Abstract

Abstract

Abstract

Abstract

Abstract