Publications - Current Year
2025
- “HMD2: Environment-aware Motion Generation from Single Egocentric Head-Mounted Device,” in 3DV 2025, 12th International Conference on 3D Vision, Singapore.
- “Unimotion: Unifying 3D Human Motion Synthesis and Understanding,” in 3DV 2025, International Conference on 3D Vision, Singapore.
- “Spurfies: Sparse-view Surface Reconstruction using Local Geometry Priors,” in 3DV 2025, International Conference on 3D Vision, Singapore.
- “InterTrack: Tracking Human Object Interaction without Object Templates,” in 3DV 2025, International Conference on 3D Vision, Singapore.
- “FORCE: Dataset and Method for Intuitive Physics Guided Human-object Interaction,” in 3DV 2025, International Conference on 3D Vision, Singapore.
- “Corrigendum to ‘A polyhedral study of lifted multicuts’ [Discrete Optim. 47 (2023) 100757],” Discrete Optimization, vol. 55, 2025.
- “TokenFormer: Rethinking Transformer Scaling with Tokenized Model Parameters,” in Thirteenth International Conference on Learning Representations (ICLR 2025), Singapore.
- “ContextGNN: Beyond Two-Tower Recommendation Systems,” in Thirteenth International Conference on Learning Representations (ICLR 2025), Singapore.
- “MEt3R: Measuring Multi-View Consistency in Generated Images,” 2025. [Online]. Available: MEt3R: Measuring Multi-View Consistency in Generated Images.mehr
Abstract
We introduce MEt3R, a metric for multi-view consistency in generated images.
Large-scale generative models for multi-view image generation are rapidly
advancing the field of 3D inference from sparse observations. However, due to
the nature of generative modeling, traditional reconstruction metrics are not
suitable to measure the quality of generated outputs and metrics that are
independent of the sampling procedure are desperately needed. In this work, we
specifically address the aspect of consistency between generated multi-view
images, which can be evaluated independently of the specific scene. Our
approach uses DUSt3R to obtain dense 3D reconstructions from image pairs in a
feed-forward manner, which are used to warp image contents from one view into
the other. Then, feature maps of these images are compared to obtain a
similarity score that is invariant to view-dependent effects. Using MEt3R, we
evaluate the consistency of a large set of previous methods for novel view and
video generation, including our open, multi-view latent diffusion model. - “PersonaHOI: Effortlessly Improving Personalized Face with Human-Object Interaction Generation,” 2025. [Online]. Available: https://arxiv.org/abs/2501.05823.mehr
Abstract
We introduce PersonaHOI, a training- and tuning-free framework that fuses a
general StableDiffusion model with a personalized face diffusion (PFD) model to
generate identity-consistent human-object interaction (HOI) images. While
existing PFD models have advanced significantly, they often overemphasize
facial features at the expense of full-body coherence, PersonaHOI introduces an
additional StableDiffusion (SD) branch guided by HOI-oriented text inputs. By
incorporating cross-attention constraints in the PFD branch and spatial merging
at both latent and residual levels, PersonaHOI preserves personalized facial
details while ensuring interactive non-facial regions. Experiments, validated
by a novel interaction alignment metric, demonstrate the superior realism and
scalability of PersonaHOI, establishing a new standard for practical
personalized face with HOI generation. Our code will be available at
github.com/JoyHuYY1412/PersonaHOI - “DCBM: Data-Efficient Visual Concept Bottleneck Models,” 2025. [Online]. Available: https://arxiv.org/abs/2412.11576.mehr
Abstract
Concept Bottleneck Models (CBMs) enhance the interpretability of neural
networks by basing predictions on human-understandable concepts. However,
current CBMs typically rely on concept sets extracted from large language
models or extensive image corpora, limiting their effectiveness in data-sparse
scenarios. We propose Data-efficient CBMs (DCBMs), which reduce the need for
large sample sizes during concept generation while preserving interpretability.
DCBMs define concepts as image regions detected by segmentation or detection
foundation models, allowing each image to generate multiple concepts across
different granularities. This removes reliance on textual descriptions and
large-scale pre-training, making DCBMs applicable for fine-grained
classification and out-of-distribution tasks. Attribution analysis using
Grad-CAM demonstrates that DCBMs deliver visual concepts that can be localized
in test images. By leveraging dataset-specific concepts instead of predefined
ones, DCBMs enhance adaptability to new domains. - “B-cos LM: Efficiently Transforming Pre-trained Language Models for Improved Explainability,” 2025. [Online]. Available: https://arxiv.org/abs/2502.12992.mehr
Abstract
Post-hoc explanation methods for black-box models often struggle with
faithfulness and human interpretability due to the lack of explainability in
current neural models. Meanwhile, B-cos networks have been introduced to
improve model explainability through architectural and computational
adaptations, but their application has so far been limited to computer vision
models and their associated training pipelines. In this work, we introduce
B-cos LMs, i.e., B-cos networks empowered for NLP tasks. Our approach directly
transforms pre-trained language models into B-cos LMs by combining B-cos
conversion and task fine-tuning, improving efficiency compared to previous
B-cos methods. Our automatic and human evaluation results demonstrate that
B-cos LMs produce more faithful and human interpretable explanations than post
hoc methods, while maintaining task performance comparable to conventional
fine-tuning. Our in-depth analysis explores how B-cos LMs differ from
conventionally fine-tuned models in their learning processes and explanation
patterns. Finally, we provide practical guidelines for effectively building
B-cos LMs based on our findings. Our code is available at
anonymous.4open.science/r/bcos_lm.