Publications - Last Year

2024

1
Conference paper
D2
D. Antić, G. Tiwari, B. Ozcomlekci, R. Marin, and G. Pons-Moll
“CloSe: A 3D Clothing Segmentation Dataset and Model,” in 3DV 2024, 11th International Conference on 3D Vision, Davos, Switzerland, 2024.
2
Conference paper
D2
V. Guzov, J. Chibane, R. Marin, Y. He, Y. Saracoglu, T. Sattler, and G. Pons-Moll
“Interaction Replica: Tracking Human–Object Interaction and Scene Changes From Human Motion,” in 3DV 2024, 11th International Conference on 3D Vision, Davos, Switzerland, 2024.
3
Conference paper
D2
B. Kabadayi, W. Zielonka, B. L. Bhatnagar, G. Pons-Moll, and J. Thies
“GAN-Avatar: Controllable Personalized GAN-based Human Head Avatar,” in 3DV 2024, 11th International Conference on 3D Vision, Davos, Switzerland, 2024.
4
Conference paper
D2
A. Mir, X. Puig, A. Kanazawa, and G. Pons-Moll
“Generating Continual Human Motion in Diverse 3D Scenes,” in 3DV 2024, 11th International Conference on 3D Vision, Davos, Switzerland, 2024.
5
Conference paper
D2
S. Arya, S. Rao, M. Boehle, and B. Schiele
“B-cosification: Transforming Deep Neural Networks to be Inherently Interpretable,” in Advances in Neural Information Processing Systems 37 (NeurIPS 2024), Vancouver, Canada, 2024.
more
Abstract
B-cos Networks have been shown to be effective for obtaining highly human
interpretable explanations of model decisions by architecturally enforcing
stronger alignment between inputs and weight. B-cos variants of convolutional
networks (CNNs) and vision transformers (ViTs), which primarily replace linear
layers with B-cos transformations, perform competitively to their respective
standard variants while also yielding explanations that are faithful by design.
However, it has so far been necessary to train these models from scratch, which
is increasingly infeasible in the era of large, pre-trained foundation models.
In this work, inspired by the architectural similarities in standard DNNs and
B-cos networks, we propose 'B-cosification', a novel approach to transform
existing pre-trained models to become inherently interpretable. We perform a
thorough study of design choices to perform this conversion, both for
convolutional neural networks and vision transformers. We find that
B-cosification can yield models that are on par with B-cos models trained from
scratch in terms of interpretability, while often outperforming them in terms
of classification performance at a fraction of the training cost. Subsequently,
we apply B-cosification to a pretrained CLIP model, and show that, even with
limited data and compute cost, we obtain a B-cosified version that is highly
interpretable and competitive on zero shot performance across a variety of
datasets. We release our code and pre-trained model weights at
github.com/shrebox/B-cosification.
6
Conference paper
D2
W. Böttcher, L. Hoyer, O. Unal, J. E. Lenssen, and B. Schiele
“Scribbles for All: Benchmarking Scribble Supervised Segmentation Across Datasets,” in Advances in Neural Information Processing Systems 37 (NeurIPS 2024), Vancouver, Canada, 2024.
7
Conference paper
D2
J. Chen, J. E. Lenssen, A. Feng, W. Hu, M. Fey, L. Tassiulas, J. Leskovec, and R. Ying
“From Similarity to Superiority: Channel Clustering for Time Series Forecasting,” in Advances in Neural Information Processing Systems 37 (NeurIPS 2024), Vancouver, Canada, 2024.
8
Conference paper
D2
I. Hossain, J. Fischer, R. Burkholz, and J. Quackenbush
“Pruning Neural Network Models for Gene Regulatory Dynamics Using Data and Domain Knowledge,” in Advances in Neural Information Processing Systems 37 (NeurIPS 2024), Vancouver, Canada, 2024.
9
Conference paper
D2
J. Robinson, R. Ranjan, W. Hu, K. Huang, J. Han, A. Dobles, M. Fey, J. E. Lenssen, Y. Yuan, Z. Zhang, X. He, and J. Leskovec
“RelBench: A Benchmark for Deep Learning on Relational Databases,” in Advances in Neural Information Processing Systems 37 (NeurIPS 2024), Vancouver, Canada, 2024.
10
Conference paper
D2
I. Sárándi and G. Pons-Moll
“Neural Localizer Fields for Continuous 3D Human Pose and Shape Estimation,” in Advances in Neural Information Processing Systems 37 (NeurIPS 2024), Vancouver, Canada, 2024.
11
Conference paper
D2
Y. Xue, X. Xie, R. Marin, and G. Pons-Moll
“Human 3Diffusion: Realistic Avatar Creation via Explicit 3D Consistent Diffusion Models,” in Advances in Neural Information Processing Systems 37 (NeurIPS 2024), Vancouver, Canada, 2024.
12
Article
D2D6
R. Yunus, J. E. Lenssen, M. Niemeyer, Y. Liao, C. Rupprecht, C. Theobalt, G. Pons-Moll, J.-B. Huang, V. Golyanik, and E. Ilg,
“Recent Trends in 3D Reconstruction of General Non-Rigid Scenes,” Computer Graphics Forum (Proc. EUROGRAPHICS 2024), vol. 43, no. 2, 2024.
13
Conference paper
D2
S. Agnihotri, J. Grabinski, and M. Keuper
“Improving Feature Stability during Upsampling - Spectral Artifacts and the Importance of Spatial Context,” in Computer Vision -- ECCV 2024, Milano, Italy, 2024.
14
Conference paper
D2
A. Das, X. Hu, L. Jiang, and B. Schiele
“MTA-CLIP: Language-Guided Semantic Segmentation with Mask-Text Alignment,” in Computer Vision -- ECCV 2024, Milano, Italy, 2024.
15
Conference paper
D2
L. Ma, Y. Ye, F. Hong, V. Guzov, Y. Jiang, R. Postyeni, L. Pesqueira, A. Gamino, V. Baiyya, H. J. Kim, K. Bailey, D. S. Fosas, C. K. Liu, Z. Liu, J. Engel, R. De Nardi, and R. Newcombe
“Nymeria: A Massive Collection of Multimodal Egocentric Daily Motion in the Wild,” in Computer Vision -- ECCV 2024, Milano, Italy, 2024.
16
Conference paper
D2
R. Marin, E. Corona, and G. Pons-Moll
“NICP: Neural ICP for 3D Human Registration at Scale,” in Computer Vision -- ECCV 2024, Milano, Italy, 2024.
17
Conference paper
D2
A. Parchami-Araghi, M. Böhle, S. S. Rao, and B. Schiele
“Good Teachers Explain: Explanation-Enhanced Knowledge Distillation,” in Computer Vision -- ECCV 2024, Milano, Italy, 2024.
18
Conference paper
D2
S. Rao, S. Mahajan, M. Böhle, and B. Schiele
“Discover-then-Name: Task-Agnostic Concept Bottlenecks via Automated Concept Discovery,” in Computer Vision -- ECCV 2024, Milan, Italy, 2024.
19
Conference paper
D2
P. Roetzer, A. Abbas, D. Cao, F. Bernard, and P. Swoboda
“DiscoMatch: Fast Discrete Optimisation for Geometrically Consistent 3D Shape Matching,” in Computer Vision -- ECCV 2024, Milano, Italy, 2024.
20
Conference paper
D2
M. Segu, L. Piccinelli, S. Li, L. V. Gool, F. Yu, and B. Schiele
“Walker: Self-supervised Multiple Object Tracking by Walking on Temporal Appearance Graphs,” in Computer Vision -- ECCV 2024, Milan, Italy, 2024.
21
Conference paper
D2
N. Shvetsova, A. Kukleva, X. Hong, C. Rupprecht, B. Schiele, and H. Kuehne
“HowToCaption: Prompting LLMs to Transform Video Annotations at Scale,” in Computer Vision -- ECCV 2024, Milano, Italy, 2024.
22
Conference paper
D2
H. Wang, H. Tang, L. Jiang, S. Shi, M. F. Naeem, H. Li, B. Schiele, and L. Wang
“GiT: Towards Generalist Vision Transformer through Universal Language Interface,” in Computer Vision -- ECCV 2024, Milano, Italy, 2024.
23
Conference paper
D2
C. Wewer, K. Raj, E. Ilg, B. Schiele, and J. E. Lenssen
“latentSplat: Autoencoding Variational Gaussians for Fast Generalizable 3D Reconstruction,” in Computer Vision -- ECCV 2024, Milano, Italy, 2024.
24
Conference paper
D2
Y. Yue, A. Das, F. Engelmann, S. Tang, and J. E. Lenssen
“Improving 2D Feature Representations by 3D-Aware Fine-Tuning,” in Computer Vision -- ECCV 2024, Milano, Italy, 2024.
25
Conference paper
D2
S. Paul, C. Wewer, B. Schiele, and J. E. Lenssen
“Sp2360: Sparse-view 360° Scene Reconstruction using Cascaded 2D Diffusion Priors,” in ECCV 2024 Workshop on Wild 3D (ECCV 2024 Wild3D), Milan, Italy, 2024.
26
Conference paper
D2
U. A. Kaplan, Y. Li, M. Keuper, A. Khoreva, and D. Zhang
“Domain-Aware Fine-Tuning of Foundation Models,” in ICML 2024 Workshop on Foundation Models in the Wild (ICML 2024 FM-Wild Workshop), Vienna, Austria, 2024.
27
Conference paper
D2
N. Ahmed, A. Kukleva, and B. Schiele
“OrCo: Towards Better Generalization via Orthogonality and Contrast for Few-Shot Class-Incremental Learning,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2024), Seattle, WA, USA, 2024.
28
Conference paper
D2
D. Das, C. Wewer, R. Yunus, E. Ilg, and J. E. Lenssen
“Neural Parametric Gaussians for Monocular Non-Rigid Object Reconstruction,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2024), Seattle, WA, USA, 2024.
29
Conference paper
D2
Y. He, G. Tiwari, T. Birdal, J. E. Lenssen, and G. Pons-Moll
“NRDF: Neural Riemannian Distance Fields for Learning Articulated Pose Priors,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2024), Seattle, WA, USA, 2024.
30
Conference paper
D2
X. Hu, L. Jiang, and B. Schiele
“Training Vision Transformers for Semi-Supervised Semantic Segmentation,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2024), Seattle, WA, USA, 2024.
31
Conference paper
D2
L. Jiang, S. Shi, and B. Schiele
“Open-Vocabulary 3D Semantic Segmentation with Foundation Models,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2024), Seattle, WA, USA, 2024.
32
Conference paper
D2
A. Kukleva, F. Sener, E. Remelli, B. Tekin, E. Sauser, B. Schiele, and S. Ma
“X-MIC: Cross-Modal Instance Conditioning for Egocentric Action Generalization,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2024), Seattle, WA, USA, 2024.
33
Conference paper
D2
P. Schröppel, C. Wewer, J. E. Lenssen, E. Ilg, and T. Brox
“Neural Point Cloud Diffusion for Disentangled 3D Shape and Appearance Generation,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2024), Seattle, WA, USA, 2024.
34
Conference paper
D2
X. Wu, L. Jiang, P.-S. Wang, Z. Liu, X. Liu, Y. Qiao, W. Ouyang, T. He, and H. Zhao
“Point Transformer V3: Simpler, Faster, Stronger,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2024), Seattle, WA, USA, 2024.
35
Conference paper
D2
X. Xie, B. L. Bhatnagar, J. E. Lenssen, and G. Pons-Moll
“Template Free Reconstruction of Human-object Interaction with Procedural Interaction Generation,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2024), Seattle, WA, USA, 2024.
36
Conference paper
D2
K. Youwang, T.-H. Oh, and G. Pons-Moll
“Paint-it: Text-to-Texture Synthesis via Deep Convolutional Texture Map Optimization and Physically-Based Rendering,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2024), Seattle, WA, USA, 2024.
37
Conference paper
D2
K. Zhou, B. L. Bhatnagar, J. E. Lenssen, and G. Pons-Moll
“GEARS: Local Geometry-aware Hand-object Interaction Synthesis,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2024), Seattle, WA, USA, 2024.
38
Conference paper
D2
H. Sommerhoff, S. Agnihotri, M. Saleh, M. Moeller, M. Keuper, and B. Choubey
“Task Driven Sensor Layouts - Joint Optimization of Pixel Layout and Network Parameters,” in IEEE International Conference on Computational Photography (ICCP 2024), Lausanne, Switzerland, 2024.
39
Article
D2
Y. Chen, Y. Guo, D. Liao, F. Lv, H. Song, and J. T.-Y. Kwok
“Automated Dominative Subspace Mining for Efficient Neural Architecture Search,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 34, no. 10, 2024.
40
Article
D2
H. Pan, Y. Guo, M. Yu, and J. Chen
“Enhanced Long-Tailed Recognition With Contrastive CutMix Augmentation,” IEEE Transactions on Image Processing, vol. 33, 2024.
41
Article
D2
M. Böhle, N. Singh, M. Fritz, and B. Schiele
“B-Cos Alignment for Inherently Interpretable CNNs and Vision Transformers,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 46, no. 6, 2024.
42
Article
D2
Y. Chen, M. Mancini, X. Zhu, and Z. Akata
“Semi-Supervised and Unsupervised Deep Visual Learning: A Survey,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 46, no. 3, 2024.
43
Article
D2
S. Rao, M. Boehle, and B. Schiele
“Better Understanding Differences in Attribution Methods via Systematic Evaluations,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 46, no. 6, 2024.
44
Article
D2
S. Shi, L. Jiang, D. Dai, and B. Schiele
“MTR++: Multi-Agent Motion Prediction With Symmetric Scene Modeling and Guided Intention Querying,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 46, no. 5, 2024.
45
Article
D2
Y. Li, D. Zhang, M. Keuper, and A. Khoreva
“Intra- & Extra-Source Exemplar-Based Style Synthesis for Improved Domain Generalization,” International Journal of Computer Vision, vol. 132, 2024.
46
Conference paper
D2
P. Gavrikov, S. Agnihotri, M. Keuper, and J. Keuper
“How Do Training Methods Influence the Utilization of Vision Models?,” in Interpretable AI: Past, Present and Future (IAI Workshop @ NeurIPS 2024), Vancouver, Canada, 2024.
47
Conference paper
D2
Y. Fan, Y. Xian, X. Zhai, A. Kolesnikov, M. F. Naeem, B. Schiele, and F. Tombari
“Toward a Diffusion-Based Generalist for Dense Vision Tasks,” in MMFM2, The 2nd Workshop on What is Next in Multimodal Foundation Models?, Seattle, WA, USA, 2024.
48
Conference paper
D2
K. Prasse, S. Jung, Y. Zhou, and M. Keuper
“Local Spherical Harmonics Improve Skeleton-Based Hand Action Recognition,” in Pattern Recognition (DAGM GCPR 2024), Munich, Germany.
more
Abstract
Hand action recognition is essential. Communication, human-robot
interactions, and gesture control are dependent on it. Skeleton-based action
recognition traditionally includes hands, which belong to the classes which
remain challenging to correctly recognize to date. We propose a method
specifically designed for hand action recognition which uses relative angular
embeddings and local Spherical Harmonics to create novel hand representations.
The use of Spherical Harmonics creates rotation-invariant representations which
make hand action recognition even more robust against inter-subject differences
and viewpoint changes. We conduct extensive experiments on the hand joints in
the First-Person Hand Action Benchmark with RGB-D Videos and 3D Hand Pose
Annotations, and on the NTU RGB+D 120 dataset, demonstrating the benefit of
using Local Spherical Harmonics Representations. Our code is available at
github.com/KathPra/LSHR_LSHT.
49
Conference paper
D2
A. Abbas and P. Swoboda
“DOGE-Train: Discrete Optimization on GPU with End-to-End Training,” in Proceedings of the 38th AAAI Conference on Artificial Intelligence, Vancouver, Canada, 2024.
50
Conference paper
D2
S. Agnihotri, S. Jung, and M. Keuper
“CosPGD: An Efficient White-Box Adversarial Attack for Pixel-Wise Prediction Tasks,” in Proceedings of the 41st International Conference on Machine Learning (ICML 2024), Vienna, Austria, 2024.
51
Conference paper
D2
A. Anani, T. Lorenz, B. Schiele, and M. Fritz
“Adaptive Hierarchical Certification for Segmentation using Randomized Smoothing,” in Proceedings of the 41st International Conference on Machine Learning (ICML 2024), Vienna, Austria, 2024.
52
Conference paper
D2
M. Fey, W. Hu, K. Huang, J. E. Lenssen, R. Ranjan, J. Robinson, R. Ying, J. You, and J. Leskovec
“Position: Relational Deep Learning - Graph Representation Learning on Relational Databases,” in Proceedings of the 41st International Conference on Machine Learning (ICML 2024), Vienna, Austria, 2024.
53
Conference paper
D2
J. P. Schneider, M. Fatima, J. Lukasik, A. Kolb, M. Keuper, and M. Moeller
“Implicit Representations for Constrained Image Segmentation,” in Proceedings of the 41st International Conference on Machine Learning (ICML 2024), Vienna, Austria, 2024.
54
Conference paper
D2
Y. Zhou, M. Fritz, and M. Keuper
“MultiMax: Sparse and Mulit-Modal Attention Learning,” in Proceedings of the 41st International Conference on Machine Learning (ICML 2024), Vienna, Austria, 2024.
55
Conference paper
D2
Y. Li, M. Keuper, D. Zhang, and A. Khoreva
“Adversarial Supervision Makes Layout-to-Image Diffusion Models Thrive,” in The Twelfth International Conference on Learning Representations (ICLR 2024), Vienna, Austria, 2024.
56
Conference paper
D2
M. Losch, M. Omran, D. Stutz, M. Fritz, and B. Schiele
“On Adversarial Training without Perturbing all Examples,” in The Twelfth International Conference on Learning Representations (ICLR 2024), Vienna, Austria, 2024.
57
Article
D2
K. Bäuerle, P. Müller, S. M. Kazim, I. Ihrke, and M. Keuper
“Learning the essential in less than 2k additional weights - a simple approach to improve image classification stability under corruptions,” Transactions on Machine Learning Research, vol. 2024, no. 6, 2024.
58
Article
D2
J. Grabinski, J. Keuper, and M. Keuper
“As large as it gets - Studying Infinitely Large Convolutions via Neural Implicit Frequency Filters,” Transactions on Machine Learning Research, vol. 2024, 2024.
59
Conference paper
D2
Y. Liu, Y. Li, B. Schiele, and Q. Sun
“Wakening Past Concepts without Past Data: Class-Incremental Learning from Online Placebos,” in WACV 2024, IEEE Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 2024.
60
Conference paper
D2
N. Pham and M. Schott
“H-POPE: Hierarchical Polling-based Probing Evaluation of Hallucinations in Large Vision-Language Models,” in Workshop: Statistical Foundations of LLMs and Foundation Models (SFLLM 2024), Vancouver, Canada, 2024.
61
Thesis
D2IMPR-CS
A. Abbas
“Efficient and Differentiable Combinatorial Optimization for Visual Computing,” Universität des Saarlandes, Saarbrücken, 2024.
62
Paper
D2
S. Agnihotri, J. Grabinski, J. Keuper, and M. Keuper
“Beware of Aliases -- Signal Preservation is Crucial for Robust Image Restoration,” 2024. [Online]. Available: https://arxiv.org/abs/2406.07435.
more
Abstract
Image restoration networks are usually comprised of an encoder and a decoder,
responsible for aggregating image content from noisy, distorted data and to
restore clean, undistorted images, respectively. Data aggregation as well as
high-resolution image generation both usually come at the risk of involving
aliases, i.e.~standard architectures put their ability to reconstruct the model
input in jeopardy to reach high PSNR values on validation data. The price to be
paid is low model robustness. In this work, we show that simply providing
alias-free paths in state-of-the-art reconstruction transformers supports
improved model robustness at low costs on the restoration performance. We do so
by proposing BOA-Restormer, a transformer-based image restoration model that
executes downsampling and upsampling operations partly in the frequency domain
to ensure alias-free paths along the entire model while potentially preserving
all relevant high-frequency information.
63
Thesis
D2IMPR-CS
M. Böhle
“Towards Designing Inherently Interpretable Deep Neural Networks for Image Classification,” Universität des Saarlandes, Saarbrücken, 2024.
64
Paper
D2
C. Braunstein, H. Petekkaya, J. E. Lenssen, M. Toneva, and E. Ilg
“SLayR: Scene Layout Generation with Rectified Flow,” 2024. [Online]. Available: https://arxiv.org/abs/2412.05003.
more
Abstract
We introduce SLayR, Scene Layout Generation with Rectified flow.
State-of-the-art text-to-image models achieve impressive results. However, they
generate images end-to-end, exposing no fine-grained control over the process.
SLayR presents a novel transformer-based rectified flow model for layout
generation over a token space that can be decoded into bounding boxes and
corresponding labels, which can then be transformed into images using existing
models. We show that established metrics for generated images are inconclusive
for evaluating their underlying scene layout, and introduce a new benchmark
suite, including a carefully designed repeatable human-evaluation procedure
that assesses the plausibility and variety of generated layouts. In contrast to
previous works, which perform well in either high variety or plausibility, we
show that our approach performs well on both of these axes at the same time. It
is also at least 5x times smaller in the number of parameters and 37% faster
than the baselines. Our complete text-to-image pipeline demonstrates the added
benefits of an interpretable and editable intermediate representation.
65
Paper
D2
J. Fischer and R. Ma
“Sailing in High-dimensional Spaces: Low-dimensional Embeddings through Angle Preservation,” 2024. [Online]. Available: https://arxiv.org/abs/2406.09876.
more
Abstract
Low-dimensional embeddings (LDEs) of high-dimensional data are ubiquitous in
science and engineering. They allow us to quickly understand the main
properties of the data, identify outliers and processing errors, and inform the
next steps of data analysis. As such, LDEs have to be faithful to the original
high-dimensional data, i.e., they should represent the relationships that are
encoded in the data, both at a local as well as global scale. The current
generation of LDE approaches focus on reconstructing local distances between
any pair of samples correctly, often out-performing traditional approaches
aiming at all distances. For these approaches, global relationships are,
however, usually strongly distorted, often argued to be an inherent trade-off
between local and global structure learning for embeddings. We suggest a new
perspective on LDE learning, reconstructing angles between data points. We show
that this approach, Mercat, yields good reconstruction across a diverse set of
experiments and metrics, and preserve structures well across all scales.
Compared to existing work, our approach also has a simple formulation,
facilitating future theoretical analysis and algorithmic improvements.
66
Paper
D2
P. Gavrikov, J. Lukasik, S. Jung, R. Geirhos, B. Lamm, M. J. Mirza, M. Keuper, and J. Keuper
“Are Vision Language Models Texture or Shape Biased and Can We Steer Them?,” 2024. [Online]. Available: https://arxiv.org/abs/2403.09193.
more
Abstract
Vision language models (VLMs) have drastically changed the computer vision
model landscape in only a few years, opening an exciting array of new
applications from zero-shot image classification, over to image captioning, and
visual question answering. Unlike pure vision models, they offer an intuitive
way to access visual content through language prompting. The wide applicability
of such models encourages us to ask whether they also align with human vision -
specifically, how far they adopt human-induced visual biases through multimodal
fusion, or whether they simply inherit biases from pure vision models. One
important visual bias is the texture vs. shape bias, or the dominance of local
over global information. In this paper, we study this bias in a wide range of
popular VLMs. Interestingly, we find that VLMs are often more shape-biased than
their vision encoders, indicating that visual biases are modulated to some
extent through text in multimodal models. If text does indeed influence visual
biases, this suggests that we may be able to steer visual biases not just
through visual input but also through language: a hypothesis that we confirm
through extensive experiments. For instance, we are able to steer shape bias
from as low as 49% to as high as 72% through prompting alone. For now, the
strong human bias towards shape (96%) remains out of reach for all tested VLMs.
67
Paper
D2
V. Guzov, I. A. Petrov, and G. Pons-Moll
“blendify – Python rendering framework for Blender,” 2024. [Online]. Available: https://arxiv.org/abs/2410.17858.
more
Abstract
With the rapid growth of the volume of research fields like computer vision
and computer graphics, researchers require effective and user-friendly
rendering tools to visualize results. While advanced tools like Blender offer
powerful capabilities, they also require a significant effort to master. This
technical report introduces Blendify, a lightweight Python-based framework that
seamlessly integrates with Blender, providing a high-level API for scene
creation and rendering. Blendify reduces the complexity of working with
Blender's native API by automating object creation, handling the colors and
material linking, and implementing features such as shadow-catcher objects
while maintaining support for high-quality ray-tracing rendering output. With a
focus on usability Blendify enables efficient and flexible rendering workflow
for rendering in common computer vision and computer graphics use cases. The
code is available at github.com/ptrvilya/blendify
68
Paper
D2
N. Kister, I. Sárándi, A. Khoreva, and G. Pons-Moll
“Are Pose Estimators Ready for the Open World? STAGE: Synthetic Data Generation Toolkit for Auditing 3D Human Pose Estimators,” 2024. [Online]. Available: https://arxiv.org/abs/2408.16536.
more
Abstract
The estimation of 3D human poses from images has progressed tremendously over
the last few years as measured on standard benchmarks. However, performance in
the open world remains underexplored, as current benchmarks cannot capture its
full extent. Especially in safety-critical systems, it is crucial that 3D pose
estimators are audited before deployment, and their sensitivity towards single
factors or attributes occurring in the operational domain is thoroughly
examined. Nevertheless, we currently lack a benchmark that would enable such
fine-grained analysis. We thus present STAGE, a GenAI data toolkit for auditing
3D human pose estimators. We enable a text-to-image model to control the 3D
human body pose in the generated image. This allows us to create customized
annotated data covering a wide range of open-world attributes. We leverage
STAGE and generate a series of benchmarks to audit the sensitivity of popular
pose estimators towards attributes such as gender, ethnicity, age, clothing,
location, and weather. Our results show that the presence of such naturally
occurring attributes can cause severe degradation in the performance of pose
estimators and leads us to question if they are ready for open-world
deployment.
69
Thesis
D2IMPR-CS
A. Kukleva
“Advancing Image and Video Recognition with Less Supervision,” Universität des Saarlandes, Saarbrücken, 2024.
more
Abstract
Deep learning is increasingly relevant in our daily lives, as it simplifies tedious tasks and enhances quality of life across various domains such as entertainment, learning, automatic assistance, and autonomous driving. However, the demand for more data to train models for emerging tasks is increasing dramatically. Deep learning models heavily depend on the quality and quantity of data, necessitating high-quality labeled datasets. Yet, each task requires different types of annotations for training and evaluation, posing challenges in obtaining comprehensive supervision. The acquisition of annotations is not only resource-intensive in terms of time and cost but also introduces biases, such as granularity in classification, where distinctions like specific breeds versus generic categories may arise. Furthermore, the dynamic nature of the world causes the challenge that previously annotated data becomes potentially irrelevant, and new categories and rare occurrences continually emerge, making it impossible to label every aspect of the world.
Therefore, this thesis aims to explore various supervision scenarios to mitigate the need for full supervision and reduce data acquisition costs. Specifically, we investigate learning without labels, referred to as self-supervised and unsupervised methods, to better understand video and image representations. To learn from data without labels, we leverage injected priors such as motion speed, direction, action order in videos, or semantic information granularity to obtain powerful data representations. Further, we study scenarios involving reduced supervision levels. To reduce annotation costs, first, we propose to omit precise annotations for one modality in multimodal learning, namely in text-video and image-video settings, and transfer available knowledge to large copora of video data. Second, we study semi-supervised learning scenarios, where only a subset of annotated data alongside unlabeled data is available, and propose to revisit regularization constraints and improve generalization to unlabeled data. Additionally, we address scenarios where parts of available data is inherently limited due to privacy and security reasons or naturally rare events, which not only restrict annotations but also limit the overall data volume. For these scenarios, we propose methods that carefully balance between previously obtained knowledge and incoming limited data by introducing a calibration method or combining a space reservation technique with orthogonality constraints. Finally, we explore multimodal and unimodal open-world scenarios where the model is asked to generalize beyond the given set of object or action classes. Specifically, we propose a new challenging setting on multimodal egocentric videos and propose an adaptation method for vision-language models to generalize on egocentric domain. Moreover, we study unimodal image recognition in an open-set setting and propose to disentangle open-set detection and image classification tasks that effectively improve generalization in different settings.
In summary, this thesis investigates challenges arising when full supervision for training models is not available. We develop methods to understand learning dynamics and the role of biases in data, while also proposing novel setups to advance training with less supervision.
70
Paper
D2
H. Li, A. Deng, Q. Ke, J. Liu, H. Rahmani, Y. Guo, B. Schiele, and C. Chen
“Sports-QA: A Large-Scale Video Question Answering Benchmark for Complex and Professional Sports,” 2024. [Online]. Available: https://arxiv.org/abs/2401.01505.
more
Abstract
Reasoning over sports videos for question answering is an important task with
numerous applications, such as player training and information retrieval.
However, this task has not been explored due to the lack of relevant datasets
and the challenging nature it presents. Most datasets for video question
answering (VideoQA) focus mainly on general and coarse-grained understanding of
daily-life videos, which is not applicable to sports scenarios requiring
professional action understanding and fine-grained motion analysis. In this
paper, we introduce the first dataset, named Sports-QA, specifically designed
for the sports VideoQA task. The Sports-QA dataset includes various types of
questions, such as descriptions, chronologies, causalities, and counterfactual
conditions, covering multiple sports. Furthermore, to address the
characteristics of the sports VideoQA task, we propose a new Auto-Focus
Transformer (AFT) capable of automatically focusing on particular scales of
temporal information for question answering. We conduct extensive experiments
on Sports-QA, including baseline studies and the evaluation of different
methods. The results demonstrate that our AFT achieves state-of-the-art
performance.
71
Paper
D2
T. Medi, A. Rampini, P. Reddy, P. K. Jayaraman, and M. Keuper
“3D-WAG: Hierarchical Wavelet-Guided Autoregressive Generation for High-Fidelity 3D Shapes,” 2024. [Online]. Available: https://arxiv.org/abs/2411.19037.
more
Abstract
Autoregressive (AR) models have achieved remarkable success in natural
language and image generation, but their application to 3D shape modeling
remains largely unexplored. Unlike diffusion models, AR models enable more
efficient and controllable generation with faster inference times, making them
especially suitable for data-intensive domains. Traditional 3D generative
models using AR approaches often rely on ``next-token" predictions at the voxel
or point level. While effective for certain applications, these methods can be
restrictive and computationally expensive when dealing with large-scale 3D
data. To tackle these challenges, we introduce 3D-WAG, an AR model for 3D
implicit distance fields that can perform unconditional shape generation,
class-conditioned and also text-conditioned shape generation. Our key idea is
to encode shapes as multi-scale wavelet token maps and use a Transformer to
predict the ``next higher-resolution token map" in an autoregressive manner. By
redefining 3D AR generation task as ``next-scale" prediction, we reduce the
computational cost of generation compared to traditional ``next-token"
prediction models, while preserving essential geometric details of 3D shapes in
a more structured and hierarchical manner. We evaluate 3D-WAG to showcase its
benefit by quantitative and qualitative comparisons with state-of-the-art
methods on widely used benchmarks. Our results show 3D-WAG achieves superior
performance in key metrics like Coverage and MMD, generating high-fidelity 3D
shapes that closely match the real data distribution.
72
Paper
D2
T. Medi, J. Grabinski, and M. Keuper
“Towards Class-wise Robustness Analysis,” 2024. [Online]. Available: https://arxiv.org/abs/2411.19853.
more
Abstract
While being very successful in solving many downstream tasks, the application
of deep neural networks is limited in real-life scenarios because of their
susceptibility to domain shifts such as common corruptions, and adversarial
attacks. The existence of adversarial examples and data corruption
significantly reduces the performance of deep classification models.
Researchers have made strides in developing robust neural architectures to
bolster decisions of deep classifiers. However, most of these works rely on
effective adversarial training methods, and predominantly focus on overall
model robustness, disregarding class-wise differences in robustness, which are
critical. Exploiting weakly robust classes is a potential avenue for attackers
to fool the image recognition models. Therefore, this study investigates
class-to-class biases across adversarially trained robust classification models
to understand their latent space structures and analyze their strong and weak
class-wise properties. We further assess the robustness of classes against
common corruptions and adversarial attacks, recognizing that class
vulnerability extends beyond the number of correct classifications for a
specific class. We find that the number of false positives of classes as
specific target classes significantly impacts their vulnerability to attacks.
Through our analysis on the Class False Positive Score, we assess a fair
evaluation of how susceptible each class is to misclassification.
73
Paper
D2D6
H. Wang, M. Mendiratta, C. Theobalt, and A. Kortylewski
“FaceGPT: Self-supervised Learning to Chat about 3D Human Faces,” 2024. [Online]. Available: https://arxiv.org/abs/2406.07163.
more
Abstract
We introduce FaceGPT, a self-supervised learning framework for Large
Vision-Language Models (VLMs) to reason about 3D human faces from images and
text. Typical 3D face reconstruction methods are specialized algorithms that
lack semantic reasoning capabilities. FaceGPT overcomes this limitation by
embedding the parameters of a 3D morphable face model (3DMM) into the token
space of a VLM, enabling the generation of 3D faces from both textual and
visual inputs. FaceGPT is trained in a self-supervised manner as a model-based
autoencoder from in-the-wild images. In particular, the hidden state of LLM is
projected into 3DMM parameters and subsequently rendered as 2D face image to
guide the self-supervised learning process via image-based reconstruction.
Without relying on expensive 3D annotations of human faces, FaceGPT obtains a
detailed understanding about 3D human faces, while preserving the capacity to
understand general user instructions. Our experiments demonstrate that FaceGPT
not only achieves high-quality 3D face reconstructions but also retains the
ability for general-purpose visual instruction following. Furthermore, FaceGPT
learns fully self-supervised to generate 3D faces based on complex textual
inputs, which opens a new direction in human face analysis.
74
Paper
D2
X. Zhang, S. Starke, V. Guzov, Z. Zhang, E. P. Pellitero, and G. Pons-Moll
“SCENIC: Scene-aware Semantic Navigation with Instruction-guided Control,” 2024. [Online]. Available: https://arxiv.org/abs/2412.15664.
more
Abstract
Synthesizing natural human motion that adapts to complex environments while
allowing creative control remains a fundamental challenge in motion synthesis.
Existing models often fall short, either by assuming flat terrain or lacking
the ability to control motion semantics through text. To address these
limitations, we introduce SCENIC, a diffusion model designed to generate human
motion that adapts to dynamic terrains within virtual scenes while enabling
semantic control through natural language. The key technical challenge lies in
simultaneously reasoning about complex scene geometry while maintaining text
control. This requires understanding both high-level navigation goals and
fine-grained environmental constraints. The model must ensure physical
plausibility and precise navigation across varied terrain, while also
preserving user-specified text control, such as ``carefully stepping over
obstacles" or ``walking upstairs like a zombie." Our solution introduces a
hierarchical scene reasoning approach. At its core is a novel scene-dependent,
goal-centric canonicalization that handles high-level goal constraint, and is
complemented by an ego-centric distance field that captures local geometric
details. This dual representation enables our model to generate physically
plausible motion across diverse 3D scenes. By implementing frame-wise text
alignment, our system achieves seamless transitions between different motion
styles while maintaining scene constraints. Experiments demonstrate our novel
diffusion model generates arbitrarily long human motions that both adapt to
complex scenes with varying terrain surfaces and respond to textual prompts.
Additionally, we show SCENIC can generalize to four real-scene datasets. Our
code, dataset, and models will be released at
\url{https://virtualhumans.mpi-inf.mpg.de/scenic/}.
75
Paper
D2
Y. Zhou, M. Keuper, and M. Fritz
“Balancing Diversity and Risk in LLM Sampling: How to Select Your Method and Parameter for Open-Ended Text Generation,” 2024. [Online]. Available: https://arxiv.org/abs/2408.13586.
more
Abstract
Sampling-based decoding strategies have been widely adopted for Large
Language Models (LLMs) in numerous applications, which target a balance between
diversity and quality via temperature tuning and tail truncation (e.g., top-k
and top-p sampling). Considering the high dynamic range of the candidate
next-token given different prefixes, recent studies propose to adaptively
truncate the tail of LLM's predicted distribution. Although improved results
haven been reported with these methods on open-ended text generation tasks, the
results are highly dependent on the curated truncation parameters and exemplar
text. In this paper, we propose a systematic way to estimate the intrinsic
capacity of a truncation sampling method by considering the trade-off between
diversity and risk at each decoding step, based on our collected prefix tree
which preserves the context of a full sentence. Our work provides a
comprehensive comparison between existing truncation sampling methods, as well
as their recommended parameters as a guideline for users.

2023

76
Conference paper
D2
T. Medi, J. Tayyub, M. Sarmad, F. Lindseth, and M. Keuper
“FullFormer: Generating Shapes Inside Shapes,” in Pattern Recognition (DAGM GCPR 2023), Heidelberg, Germany, 2024.

Abstract

Abstract

Abstract

Abstract

Abstract

Abstract

Abstract

Abstract

Abstract

Abstract

Abstract

Abstract

Abstract

Abstract

Abstract