Publications - Last Year
2024
- “CloSe: A 3D Clothing Segmentation Dataset and Model,” in 3DV 2024, 11th International Conference on 3D Vision, Davos, Switzerland, 2024.
- “Interaction Replica: Tracking Human–Object Interaction and Scene Changes From Human Motion,” in 3DV 2024, 11th International Conference on 3D Vision, Davos, Switzerland, 2024.
- “GAN-Avatar: Controllable Personalized GAN-based Human Head Avatar,” in 3DV 2024, 11th International Conference on 3D Vision, Davos, Switzerland, 2024.
- “Generating Continual Human Motion in Diverse 3D Scenes,” in 3DV 2024, 11th International Conference on 3D Vision, Davos, Switzerland, 2024.
- “B-cosification: Transforming Deep Neural Networks to be Inherently Interpretable,” in Advances in Neural Information Processing Systems 37 (NeurIPS 2024), Vancouver, Canada, 2024.mehr
Abstract
B-cos Networks have been shown to be effective for obtaining highly human
interpretable explanations of model decisions by architecturally enforcing
stronger alignment between inputs and weight. B-cos variants of convolutional
networks (CNNs) and vision transformers (ViTs), which primarily replace linear
layers with B-cos transformations, perform competitively to their respective
standard variants while also yielding explanations that are faithful by design.
However, it has so far been necessary to train these models from scratch, which
is increasingly infeasible in the era of large, pre-trained foundation models.
In this work, inspired by the architectural similarities in standard DNNs and
B-cos networks, we propose 'B-cosification', a novel approach to transform
existing pre-trained models to become inherently interpretable. We perform a
thorough study of design choices to perform this conversion, both for
convolutional neural networks and vision transformers. We find that
B-cosification can yield models that are on par with B-cos models trained from
scratch in terms of interpretability, while often outperforming them in terms
of classification performance at a fraction of the training cost. Subsequently,
we apply B-cosification to a pretrained CLIP model, and show that, even with
limited data and compute cost, we obtain a B-cosified version that is highly
interpretable and competitive on zero shot performance across a variety of
datasets. We release our code and pre-trained model weights at
github.com/shrebox/B-cosification. - “From Similarity to Superiority: Channel Clustering for Time Series Forecasting,” in Advances in Neural Information Processing Systems 37 (NeurIPS 2024), Vancouver, Canada, 2024.
- “Pruning Neural Network Models for Gene Regulatory Dynamics Using Data and Domain Knowledge,” in Advances in Neural Information Processing Systems 37 (NeurIPS 2024), Vancouver, Canada, 2024.
- “RelBench: A Benchmark for Deep Learning on Relational Databases,” in Advances in Neural Information Processing Systems 37 (NeurIPS 2024), Vancouver, Canada, 2024.
- “Neural Localizer Fields for Continuous 3D Human Pose and Shape Estimation,” in Advances in Neural Information Processing Systems 37 (NeurIPS 2024), Vancouver, Canada, 2024.
- “Human 3Diffusion: Realistic Avatar Creation via Explicit 3D Consistent Diffusion Models,” in Advances in Neural Information Processing Systems 37 (NeurIPS 2024), Vancouver, Canada, 2024.
- “Recent Trends in 3D Reconstruction of General Non-Rigid Scenes,” Computer Graphics Forum (Proc. EUROGRAPHICS 2024), vol. 43, no. 2, 2024.
- “Improving Feature Stability during Upsampling - Spectral Artifacts and the Importance of Spatial Context,” in Computer Vision -- ECCV 2024, Milano, Italy, 2024.
- “MTA-CLIP: Language-Guided Semantic Segmentation with Mask-Text Alignment,” in Computer Vision -- ECCV 2024, Milano, Italy, 2024.
- “Nymeria: A Massive Collection of Multimodal Egocentric Daily Motion in the Wild,” in Computer Vision -- ECCV 2024, Milano, Italy, 2024.
- “NICP: Neural ICP for 3D Human Registration at Scale,” in Computer Vision -- ECCV 2024, Milano, Italy, 2024.
- “Good Teachers Explain: Explanation-Enhanced Knowledge Distillation,” in Computer Vision -- ECCV 2024, Milano, Italy, 2024.
- “Discover-then-Name: Task-Agnostic Concept Bottlenecks via Automated Concept Discovery,” in Computer Vision -- ECCV 2024, Milan, Italy, 2024.
- “DiscoMatch: Fast Discrete Optimisation for Geometrically Consistent 3D Shape Matching,” in Computer Vision -- ECCV 2024, Milano, Italy, 2024.
- “Walker: Self-supervised Multiple Object Tracking by Walking on Temporal Appearance Graphs,” in Computer Vision -- ECCV 2024, Milan, Italy, 2024.
- “HowToCaption: Prompting LLMs to Transform Video Annotations at Scale,” in Computer Vision -- ECCV 2024, Milano, Italy, 2024.
- “GiT: Towards Generalist Vision Transformer through Universal Language Interface,” in Computer Vision -- ECCV 2024, Milano, Italy, 2024.
- “latentSplat: Autoencoding Variational Gaussians for Fast Generalizable 3D Reconstruction,” in Computer Vision -- ECCV 2024, Milano, Italy, 2024.
- “Improving 2D Feature Representations by 3D-Aware Fine-Tuning,” in Computer Vision -- ECCV 2024, Milano, Italy, 2024.
- “Sp2360: Sparse-view 360° Scene Reconstruction using Cascaded 2D Diffusion Priors,” in ECCV 2024 Workshop on Wild 3D (ECCV 2024 Wild3D), Milan, Italy, 2024.
- “Domain-Aware Fine-Tuning of Foundation Models,” in ICML 2024 Workshop on Foundation Models in the Wild (ICML 2024 FM-Wild Workshop), Vienna, Austria, 2024.
- “OrCo: Towards Better Generalization via Orthogonality and Contrast for Few-Shot Class-Incremental Learning,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2024), Seattle, WA, USA, 2024.
- “Neural Parametric Gaussians for Monocular Non-Rigid Object Reconstruction,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2024), Seattle, WA, USA, 2024.
- “NRDF: Neural Riemannian Distance Fields for Learning Articulated Pose Priors,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2024), Seattle, WA, USA, 2024.
- “Training Vision Transformers for Semi-Supervised Semantic Segmentation,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2024), Seattle, WA, USA, 2024.
- “Open-Vocabulary 3D Semantic Segmentation with Foundation Models,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2024), Seattle, WA, USA, 2024.
- “X-MIC: Cross-Modal Instance Conditioning for Egocentric Action Generalization,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2024), Seattle, WA, USA, 2024.
- “Neural Point Cloud Diffusion for Disentangled 3D Shape and Appearance Generation,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2024), Seattle, WA, USA, 2024.
- “Point Transformer V3: Simpler, Faster, Stronger,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2024), Seattle, WA, USA, 2024.
- “Template Free Reconstruction of Human-object Interaction with Procedural Interaction Generation,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2024), Seattle, WA, USA, 2024.
- “Paint-it: Text-to-Texture Synthesis via Deep Convolutional Texture Map Optimization and Physically-Based Rendering,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2024), Seattle, WA, USA, 2024.
- “GEARS: Local Geometry-aware Hand-object Interaction Synthesis,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2024), Seattle, WA, USA, 2024.
- “Task Driven Sensor Layouts - Joint Optimization of Pixel Layout and Network Parameters,” in IEEE International Conference on Computational Photography (ICCP 2024), Lausanne, Switzerland, 2024.
- “Automated Dominative Subspace Mining for Efficient Neural Architecture Search,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 34, no. 10, 2024.
- “Enhanced Long-Tailed Recognition With Contrastive CutMix Augmentation,” IEEE Transactions on Image Processing, vol. 33, 2024.
- “Semi-Supervised and Unsupervised Deep Visual Learning: A Survey,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 46, no. 3, 2024.
- “Better Understanding Differences in Attribution Methods via Systematic Evaluations,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 46, no. 6, 2024.
- “MTR++: Multi-Agent Motion Prediction With Symmetric Scene Modeling and Guided Intention Querying,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 46, no. 5, 2024.
- “Intra- & Extra-Source Exemplar-Based Style Synthesis for Improved Domain Generalization,” International Journal of Computer Vision, vol. 132, 2024.
- “An Evaluation of Zero-Cost Proxies - From Neural Architecture Performance Prediction to Model Robustness,” International Journal of Computer Vision, 2024.
- “How Do Training Methods Influence the Utilization of Vision Models?,” in Interpretable AI: Past, Present and Future (IAI Workshop @ NeurIPS 2024), Vancouver, Canada, 2024.
- “Toward a Diffusion-Based Generalist for Dense Vision Tasks,” in MMFM2, The 2nd Workshop on What is Next in Multimodal Foundation Models?, Seattle, WA, USA, 2024.
- “DOGE-Train: Discrete Optimization on GPU with End-to-End Training,” in Proceedings of the 38th AAAI Conference on Artificial Intelligence, Vancouver, Canada, 2024.
- “CosPGD: An Efficient White-Box Adversarial Attack for Pixel-Wise Prediction Tasks,” in Proceedings of the 41st International Conference on Machine Learning (ICML 2024), Vienna, Austria, 2024.
- “Adaptive Hierarchical Certification for Segmentation using Randomized Smoothing,” in Proceedings of the 41st International Conference on Machine Learning (ICML 2024), Vienna, Austria, 2024.
- “Position: Relational Deep Learning - Graph Representation Learning on Relational Databases,” in Proceedings of the 41st International Conference on Machine Learning (ICML 2024), Vienna, Austria, 2024.
- “Implicit Representations for Constrained Image Segmentation,” in Proceedings of the 41st International Conference on Machine Learning (ICML 2024), Vienna, Austria, 2024.
- “MultiMax: Sparse and Mulit-Modal Attention Learning,” in Proceedings of the 41st International Conference on Machine Learning (ICML 2024), Vienna, Austria, 2024.
- “Adversarial Supervision Makes Layout-to-Image Diffusion Models Thrive,” in The Twelfth International Conference on Learning Representations (ICLR 2024), Vienna, Austria, 2024.
- “On Adversarial Training without Perturbing all Examples,” in The Twelfth International Conference on Learning Representations (ICLR 2024), Vienna, Austria, 2024.
- “Learning the essential in less than 2k additional weights - a simple approach to improve image classification stability under corruptions,” Transactions on Machine Learning Research, vol. 2024, no. 6, 2024.
- “As large as it gets - Studying Infinitely Large Convolutions via Neural Implicit Frequency Filters,” Transactions on Machine Learning Research, vol. 2024, 2024.
- “Wakening Past Concepts without Past Data: Class-Incremental Learning from Online Placebos,” in WACV 2024, IEEE Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 2024.
- “Efficient and Differentiable Combinatorial Optimization for Visual Computing,” Universität des Saarlandes, Saarbrücken, 2024.
- “Beware of Aliases -- Signal Preservation is Crucial for Robust Image Restoration,” 2024. [Online]. Available: https://arxiv.org/abs/2406.07435.mehr
Abstract
Image restoration networks are usually comprised of an encoder and a decoder,
responsible for aggregating image content from noisy, distorted data and to
restore clean, undistorted images, respectively. Data aggregation as well as
high-resolution image generation both usually come at the risk of involving
aliases, i.e.~standard architectures put their ability to reconstruct the model
input in jeopardy to reach high PSNR values on validation data. The price to be
paid is low model robustness. In this work, we show that simply providing
alias-free paths in state-of-the-art reconstruction transformers supports
improved model robustness at low costs on the restoration performance. We do so
by proposing BOA-Restormer, a transformer-based image restoration model that
executes downsampling and upsampling operations partly in the frequency domain
to ensure alias-free paths along the entire model while potentially preserving
all relevant high-frequency information. - “Towards Designing Inherently Interpretable Deep Neural Networks for Image Classification,” Universität des Saarlandes, Saarbrücken, 2024.
- “Scribbles for All: Benchmarking Scribble Supervised Segmentation Across Datasets,” 2024. [Online]. Available: https://arxiv.org/abs/2408.12489.mehr
Abstract
In this work, we introduce Scribbles for All, a label and training data
generation algorithm for semantic segmentation trained on scribble labels.
Training or fine-tuning semantic segmentation models with weak supervision has
become an important topic recently and was subject to significant advances in
model quality. In this setting, scribbles are a promising label type to achieve
high quality segmentation results while requiring a much lower annotation
effort than usual pixel-wise dense semantic segmentation annotations. The main
limitation of scribbles as source for weak supervision is the lack of
challenging datasets for scribble segmentation, which hinders the development
of novel methods and conclusive evaluations. To overcome this limitation,
Scribbles for All provides scribble labels for several popular segmentation
datasets and provides an algorithm to automatically generate scribble labels
for any dataset with dense annotations, paving the way for new insights and
model advancements in the field of weakly supervised segmentation. In addition
to providing datasets and algorithm, we evaluate state-of-the-art segmentation
models on our datasets and show that models trained with our synthetic labels
perform competitively with respect to models trained on manual labels. Thus,
our datasets enable state-of-the-art research into methods for scribble-labeled
semantic segmentation. The datasets, scribble generation algorithm, and
baselines are publicly available at github.com/wbkit/Scribbles4All - “SLayR: Scene Layout Generation with Rectified Flow,” 2024. [Online]. Available: https://arxiv.org/abs/2412.05003.mehr
Abstract
We introduce SLayR, Scene Layout Generation with Rectified flow.
State-of-the-art text-to-image models achieve impressive results. However, they
generate images end-to-end, exposing no fine-grained control over the process.
SLayR presents a novel transformer-based rectified flow model for layout
generation over a token space that can be decoded into bounding boxes and
corresponding labels, which can then be transformed into images using existing
models. We show that established metrics for generated images are inconclusive
for evaluating their underlying scene layout, and introduce a new benchmark
suite, including a carefully designed repeatable human-evaluation procedure
that assesses the plausibility and variety of generated layouts. In contrast to
previous works, which perform well in either high variety or plausibility, we
show that our approach performs well on both of these axes at the same time. It
is also at least 5x times smaller in the number of parameters and 37% faster
than the baselines. Our complete text-to-image pipeline demonstrates the added
benefits of an interpretable and editable intermediate representation. - “Sailing in High-dimensional Spaces: Low-dimensional Embeddings through Angle Preservation,” 2024. [Online]. Available: https://arxiv.org/abs/2406.09876.mehr
Abstract
Low-dimensional embeddings (LDEs) of high-dimensional data are ubiquitous in
science and engineering. They allow us to quickly understand the main
properties of the data, identify outliers and processing errors, and inform the
next steps of data analysis. As such, LDEs have to be faithful to the original
high-dimensional data, i.e., they should represent the relationships that are
encoded in the data, both at a local as well as global scale. The current
generation of LDE approaches focus on reconstructing local distances between
any pair of samples correctly, often out-performing traditional approaches
aiming at all distances. For these approaches, global relationships are,
however, usually strongly distorted, often argued to be an inherent trade-off
between local and global structure learning for embeddings. We suggest a new
perspective on LDE learning, reconstructing angles between data points. We show
that this approach, Mercat, yields good reconstruction across a diverse set of
experiments and metrics, and preserve structures well across all scales.
Compared to existing work, our approach also has a simple formulation,
facilitating future theoretical analysis and algorithmic improvements. - “Are Vision Language Models Texture or Shape Biased and Can We Steer Them?,” 2024. [Online]. Available: https://arxiv.org/abs/2403.09193.mehr
Abstract
Vision language models (VLMs) have drastically changed the computer vision
model landscape in only a few years, opening an exciting array of new
applications from zero-shot image classification, over to image captioning, and
visual question answering. Unlike pure vision models, they offer an intuitive
way to access visual content through language prompting. The wide applicability
of such models encourages us to ask whether they also align with human vision -
specifically, how far they adopt human-induced visual biases through multimodal
fusion, or whether they simply inherit biases from pure vision models. One
important visual bias is the texture vs. shape bias, or the dominance of local
over global information. In this paper, we study this bias in a wide range of
popular VLMs. Interestingly, we find that VLMs are often more shape-biased than
their vision encoders, indicating that visual biases are modulated to some
extent through text in multimodal models. If text does indeed influence visual
biases, this suggests that we may be able to steer visual biases not just
through visual input but also through language: a hypothesis that we confirm
through extensive experiments. For instance, we are able to steer shape bias
from as low as 49% to as high as 72% through prompting alone. For now, the
strong human bias towards shape (96%) remains out of reach for all tested VLMs. - “blendify – Python rendering framework for Blender,” 2024. [Online]. Available: https://arxiv.org/abs/2410.17858.mehr
Abstract
With the rapid growth of the volume of research fields like computer vision
and computer graphics, researchers require effective and user-friendly
rendering tools to visualize results. While advanced tools like Blender offer
powerful capabilities, they also require a significant effort to master. This
technical report introduces Blendify, a lightweight Python-based framework that
seamlessly integrates with Blender, providing a high-level API for scene
creation and rendering. Blendify reduces the complexity of working with
Blender's native API by automating object creation, handling the colors and
material linking, and implementing features such as shadow-catcher objects
while maintaining support for high-quality ray-tracing rendering output. With a
focus on usability Blendify enables efficient and flexible rendering workflow
for rendering in common computer vision and computer graphics use cases. The
code is available at github.com/ptrvilya/blendify - “Are Pose Estimators Ready for the Open World? STAGE: Synthetic Data Generation Toolkit for Auditing 3D Human Pose Estimators,” 2024. [Online]. Available: https://arxiv.org/abs/2408.16536.mehr
Abstract
The estimation of 3D human poses from images has progressed tremendously over
the last few years as measured on standard benchmarks. However, performance in
the open world remains underexplored, as current benchmarks cannot capture its
full extent. Especially in safety-critical systems, it is crucial that 3D pose
estimators are audited before deployment, and their sensitivity towards single
factors or attributes occurring in the operational domain is thoroughly
examined. Nevertheless, we currently lack a benchmark that would enable such
fine-grained analysis. We thus present STAGE, a GenAI data toolkit for auditing
3D human pose estimators. We enable a text-to-image model to control the 3D
human body pose in the generated image. This allows us to create customized
annotated data covering a wide range of open-world attributes. We leverage
STAGE and generate a series of benchmarks to audit the sensitivity of popular
pose estimators towards attributes such as gender, ethnicity, age, clothing,
location, and weather. Our results show that the presence of such naturally
occurring attributes can cause severe degradation in the performance of pose
estimators and leads us to question if they are ready for open-world
deployment. - “Advancing Image and Video Recognition with Less Supervision,” Universität des Saarlandes, Saarbrücken, 2024.mehr
Abstract
Deep learning is increasingly relevant in our daily lives, as it simplifies tedious tasks and enhances quality of life across various domains such as entertainment, learning, automatic assistance, and autonomous driving. However, the demand for more data to train models for emerging tasks is increasing dramatically. Deep learning models heavily depend on the quality and quantity of data, necessitating high-quality labeled datasets. Yet, each task requires different types of annotations for training and evaluation, posing challenges in obtaining comprehensive supervision. The acquisition of annotations is not only resource-intensive in terms of time and cost but also introduces biases, such as granularity in classification, where distinctions like specific breeds versus generic categories may arise. Furthermore, the dynamic nature of the world causes the challenge that previously annotated data becomes potentially irrelevant, and new categories and rare occurrences continually emerge, making it impossible to label every aspect of the world.
Therefore, this thesis aims to explore various supervision scenarios to mitigate the need for full supervision and reduce data acquisition costs. Specifically, we investigate learning without labels, referred to as self-supervised and unsupervised methods, to better understand video and image representations. To learn from data without labels, we leverage injected priors such as motion speed, direction, action order in videos, or semantic information granularity to obtain powerful data representations. Further, we study scenarios involving reduced supervision levels. To reduce annotation costs, first, we propose to omit precise annotations for one modality in multimodal learning, namely in text-video and image-video settings, and transfer available knowledge to large copora of video data. Second, we study semi-supervised learning scenarios, where only a subset of annotated data alongside unlabeled data is available, and propose to revisit regularization constraints and improve generalization to unlabeled data. Additionally, we address scenarios where parts of available data is inherently limited due to privacy and security reasons or naturally rare events, which not only restrict annotations but also limit the overall data volume. For these scenarios, we propose methods that carefully balance between previously obtained knowledge and incoming limited data by introducing a calibration method or combining a space reservation technique with orthogonality constraints. Finally, we explore multimodal and unimodal open-world scenarios where the model is asked to generalize beyond the given set of object or action classes. Specifically, we propose a new challenging setting on multimodal egocentric videos and propose an adaptation method for vision-language models to generalize on egocentric domain. Moreover, we study unimodal image recognition in an open-set setting and propose to disentangle open-set detection and image classification tasks that effectively improve generalization in different settings.
In summary, this thesis investigates challenges arising when full supervision for training models is not available. We develop methods to understand learning dynamics and the role of biases in data, while also proposing novel setups to advance training with less supervision. - “Sports-QA: A Large-Scale Video Question Answering Benchmark for Complex and Professional Sports,” 2024. [Online]. Available: https://arxiv.org/abs/2401.01505.mehr
Abstract
Reasoning over sports videos for question answering is an important task with
numerous applications, such as player training and information retrieval.
However, this task has not been explored due to the lack of relevant datasets
and the challenging nature it presents. Most datasets for video question
answering (VideoQA) focus mainly on general and coarse-grained understanding of
daily-life videos, which is not applicable to sports scenarios requiring
professional action understanding and fine-grained motion analysis. In this
paper, we introduce the first dataset, named Sports-QA, specifically designed
for the sports VideoQA task. The Sports-QA dataset includes various types of
questions, such as descriptions, chronologies, causalities, and counterfactual
conditions, covering multiple sports. Furthermore, to address the
characteristics of the sports VideoQA task, we propose a new Auto-Focus
Transformer (AFT) capable of automatically focusing on particular scales of
temporal information for question answering. We conduct extensive experiments
on Sports-QA, including baseline studies and the evaluation of different
methods. The results demonstrate that our AFT achieves state-of-the-art
performance. - “VSTAR: Generative Temporal Nursing for Longer Dynamic Video Synthesis,” 2024. [Online]. Available: https://arxiv.org/abs/2403.13501.mehr
Abstract
Despite tremendous progress in the field of text-to-video (T2V) synthesis,
open-sourced T2V diffusion models struggle to generate longer videos with
dynamically varying and evolving content. They tend to synthesize quasi-static
videos, ignoring the necessary visual change-over-time implied in the text
prompt. At the same time, scaling these models to enable longer, more dynamic
video synthesis often remains computationally intractable. To address this
challenge, we introduce the concept of Generative Temporal Nursing (GTN), where
we aim to alter the generative process on the fly during inference to improve
control over the temporal dynamics and enable generation of longer videos. We
propose a method for GTN, dubbed VSTAR, which consists of two key ingredients:
1) Video Synopsis Prompting (VSP) - automatic generation of a video synopsis
based on the original single prompt leveraging LLMs, which gives accurate
textual guidance to different visual states of longer videos, and 2) Temporal
Attention Regularization (TAR) - a regularization technique to refine the
temporal attention units of the pre-trained T2V diffusion models, which enables
control over the video dynamics. We experimentally showcase the superiority of
the proposed approach in generating longer, visually appealing videos over
existing open-sourced T2V models. We additionally analyze the temporal
attention maps realized with and without VSTAR, demonstrating the importance of
applying our method to mitigate neglect of the desired visual change over time. - “3D-WAG: Hierarchical Wavelet-Guided Autoregressive Generation for High-Fidelity 3D Shapes,” 2024. [Online]. Available: https://arxiv.org/abs/2411.19037.mehr
Abstract
Autoregressive (AR) models have achieved remarkable success in natural
language and image generation, but their application to 3D shape modeling
remains largely unexplored. Unlike diffusion models, AR models enable more
efficient and controllable generation with faster inference times, making them
especially suitable for data-intensive domains. Traditional 3D generative
models using AR approaches often rely on ``next-token" predictions at the voxel
or point level. While effective for certain applications, these methods can be
restrictive and computationally expensive when dealing with large-scale 3D
data. To tackle these challenges, we introduce 3D-WAG, an AR model for 3D
implicit distance fields that can perform unconditional shape generation,
class-conditioned and also text-conditioned shape generation. Our key idea is
to encode shapes as multi-scale wavelet token maps and use a Transformer to
predict the ``next higher-resolution token map" in an autoregressive manner. By
redefining 3D AR generation task as ``next-scale" prediction, we reduce the
computational cost of generation compared to traditional ``next-token"
prediction models, while preserving essential geometric details of 3D shapes in
a more structured and hierarchical manner. We evaluate 3D-WAG to showcase its
benefit by quantitative and qualitative comparisons with state-of-the-art
methods on widely used benchmarks. Our results show 3D-WAG achieves superior
performance in key metrics like Coverage and MMD, generating high-fidelity 3D
shapes that closely match the real data distribution. - “Towards Class-wise Robustness Analysis,” 2024. [Online]. Available: https://arxiv.org/abs/2411.19853.mehr
Abstract
While being very successful in solving many downstream tasks, the application
of deep neural networks is limited in real-life scenarios because of their
susceptibility to domain shifts such as common corruptions, and adversarial
attacks. The existence of adversarial examples and data corruption
significantly reduces the performance of deep classification models.
Researchers have made strides in developing robust neural architectures to
bolster decisions of deep classifiers. However, most of these works rely on
effective adversarial training methods, and predominantly focus on overall
model robustness, disregarding class-wise differences in robustness, which are
critical. Exploiting weakly robust classes is a potential avenue for attackers
to fool the image recognition models. Therefore, this study investigates
class-to-class biases across adversarially trained robust classification models
to understand their latent space structures and analyze their strong and weak
class-wise properties. We further assess the robustness of classes against
common corruptions and adversarial attacks, recognizing that class
vulnerability extends beyond the number of correct classifications for a
specific class. We find that the number of false positives of classes as
specific target classes significantly impacts their vulnerability to attacks.
Through our analysis on the Class False Positive Score, we assess a fair
evaluation of how susceptible each class is to misclassification. - “FAIR-TAT: Improving Model Fairness Using Targeted Adversarial Training,” 2024. [Online]. Available: https://arxiv.org/abs/2410.23142.mehr
Abstract
Deep neural networks are susceptible to adversarial attacks and common
corruptions, which undermine their robustness. In order to enhance model
resilience against such challenges, Adversarial Training (AT) has emerged as a
prominent solution. Nevertheless, adversarial robustness is often attained at
the expense of model fairness during AT, i.e., disparity in class-wise
robustness of the model. While distinctive classes become more robust towards
such adversaries, hard to detect classes suffer. Recently, research has focused
on improving model fairness specifically for perturbed images, overlooking the
accuracy of the most likely non-perturbed data. Additionally, despite their
robustness against the adversaries encountered during model training,
state-of-the-art adversarial trained models have difficulty maintaining
robustness and fairness when confronted with diverse adversarial threats or
common corruptions. In this work, we address the above concerns by introducing
a novel approach called Fair Targeted Adversarial Training (FAIR-TAT). We show
that using targeted adversarial attacks for adversarial training (instead of
untargeted attacks) can allow for more favorable trade-offs with respect to
adversarial fairness. Empirical results validate the efficacy of our approach. - “Samba: Synchronized Set-of-Sequences Modeling for Multiple Object Tracking,” 2024. [Online]. Available: https://arxiv.org/abs/2410.01806.mehr
Abstract
Multiple object tracking in complex scenarios - such as coordinated dance
performances, team sports, or dynamic animal groups - presents unique
challenges. In these settings, objects frequently move in coordinated patterns,
occlude each other, and exhibit long-term dependencies in their trajectories.
However, it remains a key open research question on how to model long-range
dependencies within tracklets, interdependencies among tracklets, and the
associated temporal occlusions. To this end, we introduce Samba, a novel
linear-time set-of-sequences model designed to jointly process multiple
tracklets by synchronizing the multiple selective state-spaces used to model
each tracklet. Samba autoregressively predicts the future track query for each
sequence while maintaining synchronized long-term memory representations across
tracklets. By integrating Samba into a tracking-by-propagation framework, we
propose SambaMOTR, the first tracker effectively addressing the aforementioned
issues, including long-range dependencies, tracklet interdependencies, and
temporal occlusions. Additionally, we introduce an effective technique for
dealing with uncertain observations (MaskObs) and an efficient training recipe
to scale SambaMOTR to longer sequences. By modeling long-range dependencies and
interactions among tracked objects, SambaMOTR implicitly learns to track
objects accurately through occlusions without any hand-crafted heuristics. Our
approach significantly surpasses prior state-of-the-art on the DanceTrack, BFT,
and SportsMOT datasets. - “FaceGPT: Self-supervised Learning to Chat about 3D Human Faces,” 2024. [Online]. Available: https://arxiv.org/abs/2406.07163.mehr
Abstract
We introduce FaceGPT, a self-supervised learning framework for Large
Vision-Language Models (VLMs) to reason about 3D human faces from images and
text. Typical 3D face reconstruction methods are specialized algorithms that
lack semantic reasoning capabilities. FaceGPT overcomes this limitation by
embedding the parameters of a 3D morphable face model (3DMM) into the token
space of a VLM, enabling the generation of 3D faces from both textual and
visual inputs. FaceGPT is trained in a self-supervised manner as a model-based
autoencoder from in-the-wild images. In particular, the hidden state of LLM is
projected into 3DMM parameters and subsequently rendered as 2D face image to
guide the self-supervised learning process via image-based reconstruction.
Without relying on expensive 3D annotations of human faces, FaceGPT obtains a
detailed understanding about 3D human faces, while preserving the capacity to
understand general user instructions. Our experiments demonstrate that FaceGPT
not only achieves high-quality 3D face reconstructions but also retains the
ability for general-purpose visual instruction following. Furthermore, FaceGPT
learns fully self-supervised to generate 3D faces based on complex textual
inputs, which opens a new direction in human face analysis. - “Number it: Temporal Grounding Videos like Flipping Manga,” 2024. [Online]. Available: https://arxiv.org/abs/2411.10332.mehr
Abstract
Video Large Language Models (Vid-LLMs) have made remarkable advancements in
comprehending video content for QA dialogue. However, they struggle to extend
this visual understanding to tasks requiring precise temporal localization,
known as Video Temporal Grounding (VTG). To address this gap, we introduce
Number-Prompt (NumPro), a novel method that empowers Vid-LLMs to bridge visual
comprehension with temporal grounding by adding unique numerical identifiers to
each video frame. Treating a video as a sequence of numbered frame images,
NumPro transforms VTG into an intuitive process: flipping through manga panels
in sequence. This allows Vid-LLMs to "read" event timelines, accurately linking
visual content with corresponding temporal information. Our experiments
demonstrate that NumPro significantly boosts VTG performance of top-tier
Vid-LLMs without additional computational cost. Furthermore, fine-tuning on a
NumPro-enhanced dataset defines a new state-of-the-art for VTG, surpassing
previous top-performing methods by up to 6.9\% in mIoU for moment retrieval and
8.5\% in mAP for highlight detection. The code will be available at
github.com/yongliang-wu/NumPro. - “SCENIC: Scene-aware Semantic Navigation with Instruction-guided Control,” 2024. [Online]. Available: https://arxiv.org/abs/2412.15664.mehr
Abstract
Synthesizing natural human motion that adapts to complex environments while
allowing creative control remains a fundamental challenge in motion synthesis.
Existing models often fall short, either by assuming flat terrain or lacking
the ability to control motion semantics through text. To address these
limitations, we introduce SCENIC, a diffusion model designed to generate human
motion that adapts to dynamic terrains within virtual scenes while enabling
semantic control through natural language. The key technical challenge lies in
simultaneously reasoning about complex scene geometry while maintaining text
control. This requires understanding both high-level navigation goals and
fine-grained environmental constraints. The model must ensure physical
plausibility and precise navigation across varied terrain, while also
preserving user-specified text control, such as ``carefully stepping over
obstacles" or ``walking upstairs like a zombie." Our solution introduces a
hierarchical scene reasoning approach. At its core is a novel scene-dependent,
goal-centric canonicalization that handles high-level goal constraint, and is
complemented by an ego-centric distance field that captures local geometric
details. This dual representation enables our model to generate physically
plausible motion across diverse 3D scenes. By implementing frame-wise text
alignment, our system achieves seamless transitions between different motion
styles while maintaining scene constraints. Experiments demonstrate our novel
diffusion model generates arbitrarily long human motions that both adapt to
complex scenes with varying terrain surfaces and respond to textual prompts.
Additionally, we show SCENIC can generalize to four real-scene datasets. Our
code, dataset, and models will be released at
\url{https://virtualhumans.mpi-inf.mpg.de/scenic/}. - “Balancing Diversity and Risk in LLM Sampling: How to Select Your Method and Parameter for Open-Ended Text Generation,” 2024. [Online]. Available: https://arxiv.org/abs/2408.13586.mehr
Abstract
Sampling-based decoding strategies have been widely adopted for Large
Language Models (LLMs) in numerous applications, which target a balance between
diversity and quality via temperature tuning and tail truncation (e.g., top-k
and top-p sampling). Considering the high dynamic range of the candidate
next-token given different prefixes, recent studies propose to adaptively
truncate the tail of LLM's predicted distribution. Although improved results
haven been reported with these methods on open-ended text generation tasks, the
results are highly dependent on the curated truncation parameters and exemplar
text. In this paper, we propose a systematic way to estimate the intrinsic
capacity of a truncation sampling method by considering the trade-off between
diversity and risk at each decoding step, based on our collected prefix tree
which preserves the context of a full sentence. Our work provides a
comprehensive comparison between existing truncation sampling methods, as well
as their recommended parameters as a guideline for users.