Publications - Current Year
2025
- “HMD2: Environment-aware Motion Generation from Single Egocentric Head-Mounted Device,” in 3DV 2025, 12th International Conference on 3D Vision, Singapore.
- “Unimotion: Unifying 3D Human Motion Synthesis and Understanding,” in 3DV 2025, 12th International Conference on 3D Vision, Singapore.
- “Spurfies: Sparse-view Surface Reconstruction using Local Geometry Priors,” in 3DV 2025, International Conference on 3D Vision, Singapore.
- “Gaussians-to-Life: Text-Driven Animation of 3D Gaussian Splatting Scenes,” in 3DV 2025, International Conference on 3D Vision, Singapore.mehr
Abstract
State-of-the-art novel view synthesis methods achieve impressive results for
multi-view captures of static 3D scenes. However, the reconstructed scenes
still lack "liveliness," a key component for creating engaging 3D experiences.
Recently, novel video diffusion models generate realistic videos with complex
motion and enable animations of 2D images, however they cannot naively be used
to animate 3D scenes as they lack multi-view consistency. To breathe life into
the static world, we propose Gaussians2Life, a method for animating parts of
high-quality 3D scenes in a Gaussian Splatting representation. Our key idea is
to leverage powerful video diffusion models as the generative component of our
model and to combine these with a robust technique to lift 2D videos into
meaningful 3D motion. We find that, in contrast to prior work, this enables
realistic animations of complex, pre-existing 3D scenes and further enables the
animation of a large variety of object classes, while related work is mostly
focused on prior-based character animation, or single 3D objects. Our model
enables the creation of consistent, immersive 3D experiences for arbitrary
scenes. - “InterTrack: Tracking Human Object Interaction without Object Templates,” in 3DV 2025, International Conference on 3D Vision, Singapore.
- “FORCE: Dataset and Method for Intuitive Physics Guided Human-object Interaction,” in 3DV 2025, International Conference on 3D Vision, Singapore.
- “Corrigendum to ‘A polyhedral study of lifted multicuts’ [Discrete Optim. 47 (2023) 100757],” Discrete Optimization, vol. 55, 2025.
- “EgoLM: Multi-Modal Language Model of Egocentric Motions,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2025), Nashville, TN, USA.
- “Test-Time Visual In-Context Tuning,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2025), Nashville, TN, USA.
- “TokenFormer: Rethinking Transformer Scaling with Tokenized Model Parameters,” in Thirteenth International Conference on Learning Representations (ICLR 2025), Singapore.
- “ContextGNN: Beyond Two-Tower Recommendation Systems,” in Thirteenth International Conference on Learning Representations (ICLR 2025), Singapore.
- “MEt3R: Measuring Multi-View Consistency in Generated Images,” 2025. [Online]. Available: MEt3R: Measuring Multi-View Consistency in Generated Images.mehr
Abstract
We introduce MEt3R, a metric for multi-view consistency in generated images.
Large-scale generative models for multi-view image generation are rapidly
advancing the field of 3D inference from sparse observations. However, due to
the nature of generative modeling, traditional reconstruction metrics are not
suitable to measure the quality of generated outputs and metrics that are
independent of the sampling procedure are desperately needed. In this work, we
specifically address the aspect of consistency between generated multi-view
images, which can be evaluated independently of the specific scene. Our
approach uses DUSt3R to obtain dense 3D reconstructions from image pairs in a
feed-forward manner, which are used to warp image contents from one view into
the other. Then, feature maps of these images are compared to obtain a
similarity score that is invariant to view-dependent effects. Using MEt3R, we
evaluate the consistency of a large set of previous methods for novel view and
video generation, including our open, multi-view latent diffusion model. - “Identifying Sex Differences in Lung Adenocarcinoma Using Multi-Omics Integrative Protein Signaling Networks.” 2025.
- “How to Probe: Simple Yet Effective Techniques for Improving Post-hoc Explanations,” 2025. [Online]. Available: https://www.arxiv.org/abs/2503.00641.mehr
Abstract
Post-hoc importance attribution methods are a popular tool for "explaining"
Deep Neural Networks (DNNs) and are inherently based on the assumption that the
explanations can be applied independently of how the models were trained.
Contrarily, in this work we bring forward empirical evidence that challenges
this very notion. Surprisingly, we discover a strong dependency on and
demonstrate that the training details of a pre-trained model's classification
layer (less than 10 percent of model parameters) play a crucial role, much more
than the pre-training scheme itself. This is of high practical relevance: (1)
as techniques for pre-training models are becoming increasingly diverse,
understanding the interplay between these techniques and attribution methods is
critical; (2) it sheds light on an important yet overlooked assumption of
post-hoc attribution methods which can drastically impact model explanations
and how they are interpreted eventually. With this finding we also present
simple yet effective adjustments to the classification layers, that can
significantly enhance the quality of model explanations. We validate our
findings across several visual pre-training frameworks (fully-supervised,
self-supervised, contrastive vision-language training) and analyse how they
impact explanations for a wide range of attribution methods on a diverse set of
evaluation metrics. - “PersonaHOI: Effortlessly Improving Personalized Face with Human-Object Interaction Generation,” 2025. [Online]. Available: https://arxiv.org/abs/2501.05823.mehr
Abstract
We introduce PersonaHOI, a training- and tuning-free framework that fuses a
general StableDiffusion model with a personalized face diffusion (PFD) model to
generate identity-consistent human-object interaction (HOI) images. While
existing PFD models have advanced significantly, they often overemphasize
facial features at the expense of full-body coherence, PersonaHOI introduces an
additional StableDiffusion (SD) branch guided by HOI-oriented text inputs. By
incorporating cross-attention constraints in the PFD branch and spatial merging
at both latent and residual levels, PersonaHOI preserves personalized facial
details while ensuring interactive non-facial regions. Experiments, validated
by a novel interaction alignment metric, demonstrate the superior realism and
scalability of PersonaHOI, establishing a new standard for practical
personalized face with HOI generation. Our code will be available at
github.com/JoyHuYY1412/PersonaHOI - “Escaping Plato’s Cave: Robust Conceptual Reasoning through Interpretable 3D Neural Object Volumes,” 2025. [Online]. Available: https://arxiv.org/abs/2503.13429.mehr
Abstract
With the rise of neural networks, especially in high-stakes applications,
these networks need two properties (i) robustness and (ii) interpretability to
ensure their safety. Recent advances in classifiers with 3D volumetric object
representations have demonstrated a greatly enhanced robustness in
out-of-distribution data. However, these 3D-aware classifiers have not been
studied from the perspective of interpretability. We introduce CAVE - Concept
Aware Volumes for Explanations - a new direction that unifies interpretability
and robustness in image classification. We design an inherently-interpretable
and robust classifier by extending existing 3D-aware classifiers with concepts
extracted from their volumetric representations for classification. In an array
of quantitative metrics for interpretability, we compare against different
concept-based approaches across the explainable AI literature and show that
CAVE discovers well-grounded concepts that are used consistently across images,
while achieving superior robustness. - “DCBM: Data-Efficient Visual Concept Bottleneck Models,” 2025. [Online]. Available: https://arxiv.org/abs/2412.11576.mehr
Abstract
Concept Bottleneck Models (CBMs) enhance the interpretability of neural
networks by basing predictions on human-understandable concepts. However,
current CBMs typically rely on concept sets extracted from large language
models or extensive image corpora, limiting their effectiveness in data-sparse
scenarios. We propose Data-efficient CBMs (DCBMs), which reduce the need for
large sample sizes during concept generation while preserving interpretability.
DCBMs define concepts as image regions detected by segmentation or detection
foundation models, allowing each image to generate multiple concepts across
different granularities. This removes reliance on textual descriptions and
large-scale pre-training, making DCBMs applicable for fine-grained
classification and out-of-distribution tasks. Attribution analysis using
Grad-CAM demonstrates that DCBMs deliver visual concepts that can be localized
in test images. By leveraging dataset-specific concepts instead of predefined
ones, DCBMs enhance adaptability to new domains. - “B-cos LM: Efficiently Transforming Pre-trained Language Models for Improved Explainability,” 2025. [Online]. Available: https://arxiv.org/abs/2502.12992.mehr
Abstract
Post-hoc explanation methods for black-box models often struggle with
faithfulness and human interpretability due to the lack of explainability in
current neural models. Meanwhile, B-cos networks have been introduced to
improve model explainability through architectural and computational
adaptations, but their application has so far been limited to computer vision
models and their associated training pipelines. In this work, we introduce
B-cos LMs, i.e., B-cos networks empowered for NLP tasks. Our approach directly
transforms pre-trained language models into B-cos LMs by combining B-cos
conversion and task fine-tuning, improving efficiency compared to previous
B-cos methods. Our automatic and human evaluation results demonstrate that
B-cos LMs produce more faithful and human interpretable explanations than post
hoc methods, while maintaining task performance comparable to conventional
fine-tuning. Our in-depth analysis explores how B-cos LMs differ from
conventionally fine-tuned models in their learning processes and explanation
patterns. Finally, we provide practical guidelines for effectively building
B-cos LMs based on our findings. Our code is available at
anonymous.4open.science/r/bcos_lm. - “Spatial Reasoning with Denoising Models,” 2025. [Online]. Available: https://www.arxiv.org/abs/2502.21075.mehr
Abstract
We introduce Spatial Reasoning Models (SRMs), a framework to perform
reasoning over sets of continuous variables via denoising generative models.
SRMs infer continuous representations on a set of unobserved variables, given
observations on observed variables. Current generative models on spatial
domains, such as diffusion and flow matching models, often collapse to
hallucination in case of complex distributions. To measure this, we introduce a
set of benchmark tasks that test the quality of complex reasoning in generative
models and can quantify hallucination. The SRM framework allows to report key
findings about importance of sequentialization in generation, the associated
order, as well as the sampling strategies during training. It demonstrates, for
the first time, that order of generation can successfully be predicted by the
denoising network itself. Using these findings, we can increase the accuracy of
specific reasoning tasks from 1% to >50%.