Publications - Last Year
2024
- “CloSe: A 3D Clothing Segmentation Dataset and Model,” in 3DV 2024, 11th International Conference on 3D Vision, Davos, Switzerland, 2024.
- “Interaction Replica: Tracking Human–Object Interaction and Scene Changes From Human Motion,” in 3DV 2024, 11th International Conference on 3D Vision, Davos, Switzerland, 2024.
- “GAN-Avatar: Controllable Personalized GAN-based Human Head Avatar,” in 3DV 2024, 11th International Conference on 3D Vision, Davos, Switzerland, 2024.
- “Generating Continual Human Motion in Diverse 3D Scenes,” in 3DV 2024, 11th International Conference on 3D Vision, Davos, Switzerland, 2024.
- “B-cosification: Transforming Deep Neural Networks to be Inherently Interpretable,” in Advances in Neural Information Processing Systems 37 (NeurIPS 2024), Vancouver, Canada.more
Abstract
B-cos Networks have been shown to be effective for obtaining highly human
interpretable explanations of model decisions by architecturally enforcing
stronger alignment between inputs and weight. B-cos variants of convolutional
networks (CNNs) and vision transformers (ViTs), which primarily replace linear
layers with B-cos transformations, perform competitively to their respective
standard variants while also yielding explanations that are faithful by design.
However, it has so far been necessary to train these models from scratch, which
is increasingly infeasible in the era of large, pre-trained foundation models.
In this work, inspired by the architectural similarities in standard DNNs and
B-cos networks, we propose 'B-cosification', a novel approach to transform
existing pre-trained models to become inherently interpretable. We perform a
thorough study of design choices to perform this conversion, both for
convolutional neural networks and vision transformers. We find that
B-cosification can yield models that are on par with B-cos models trained from
scratch in terms of interpretability, while often outperforming them in terms
of classification performance at a fraction of the training cost. Subsequently,
we apply B-cosification to a pretrained CLIP model, and show that, even with
limited data and compute cost, we obtain a B-cosified version that is highly
interpretable and competitive on zero shot performance across a variety of
datasets. We release our code and pre-trained model weights at
github.com/shrebox/B-cosification. - “Pruning Neural Network Models for Gene Regulatory Dynamics Using Data and Domain Knowledge,” in Advances in Neural Information Processing Systems 37 (NeurIPS 2024), Vancouver, Canada.
- “Recent Trends in 3D Reconstruction of General Non-Rigid Scenes,” Computer Graphics Forum (Proc. EUROGRAPHICS 2024), vol. 43, no. 2, 2024.
- “Improving Feature Stability during Upsampling - Spectral Artifacts and the Importance of Spatial Context,” in Computer Vision -- ECCV 2024, Milano, Italy, 2024.
- “MTA-CLIP: Language-Guided Semantic Segmentation with Mask-Text Alignment,” in Computer Vision -- ECCV 2024, Milano, Italy, 2024.
- “Good Teachers Explain: Explanation-Enhanced Knowledge Distillation,” in Computer Vision -- ECCV 2024, Milano, Italy.
- “Discover-then-Name: Task-Agnostic Concept Bottlenecks via Automated Concept Discovery,” in Computer Vision -- ECCV 2024, Milan, Italy, 2024.
- “Walker: Self-supervised Multiple Object Tracking by Walking on Temporal Appearance Graphs,” in Computer Vision -- ECCV 2024, Milan, Italy, 2024.
- “HowToCaption: Prompting LLMs to Transform Video Annotations at Scale,” in Computer Vision -- ECCV 2024, Milano, Italy, 2024.
- “GiT: Towards Generalist Vision Transformer through Universal Language Interface,” in Computer Vision -- ECCV 2024, Milano, Italy, 2024.
- “LatentSplat: Autoencoding Variational Gaussians for Fast Generalizable 3D Reconstruction,” in Computer Vision -- ECCV 2024, Milano, Italy, 2024.
- “Improving 2D Feature Representations by 3D-Aware Fine-Tuning,” in Computer Vision -- ECCV 2024, Milano, Italy.
- “Sp2360: Sparse-view 360◦ Scene Reconstruction using Cascaded 2D Diffusion Priors,” in ECCV 2024 Workshop on Wild 3D (ECCV 2024 Wild3D), Milan, Italy, 2024.
- “Domain-Aware Fine-Tuning of Foundation Models,” in ICML 2024 Workshop on Foundation Models in the Wild (ICML 2024 FM-Wild Workshop), Vienna, Austria, 2024.
- “OrCo: Towards Better Generalization via Orthogonality and Contrast for Few-Shot Class-Incremental Learning,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2024), Seattle, WA, USA, 2024.
- “Neural Parametric Gaussians for Monocular Non-Rigid Object Reconstruction,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2024), Seattle, WA, USA.
- “NRDF: Neural Riemannian Distance Fields for Learning Articulated Pose Priors,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2024), Seattle, WA, USA, 2024.
- “Training Vision Transformers for Semi-Supervised Semantic Segmentation,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2024), Seattle, WA, USA, 2024.
- “Open-Vocabulary 3D Semantic Segmentation with Foundation Models,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2024), Seattle, WA, USA, 2024.
- “X-MIC: Cross-Modal Instance Conditioning for Egocentric Action Generalization,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2024), Seattle, WA, USA, 2024.
- “Neural Point Cloud Diffusion for Disentangled 3D Shape and Appearance Generation,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2024), Seattle, WA, USA, 2024.
- “Point Transformer V3: Simpler, Faster, Stronger,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2024), Seattle, WA, USA.
- “Template Free Reconstruction of Human-object Interaction with Procedural Interaction Generation,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2024), Seattle, WA, USA.
- “Paint-it: Text-to-Texture Synthesis via Deep Convolutional Texture Map Optimization and Physically-Based Rendering,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2024), Seattle, WA, USA, 2024.
- “GEARS: Local Geometry-aware Hand-object Interaction Synthesis,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2024), Seattle, WA, USA.
- “Task Driven Sensor Layouts - Joint Optimization of Pixel Layout and Network Parameters,” in IEEE International Conference on Computational Photography (ICCP 2024), Lausanne, Switzrland, 2024.
- “Automated Dominative Subspace Mining for Efficient Neural Architecture Search,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 34, no. 10, 2024.
- “Enhanced Long-Tailed Recognition With Contrastive CutMix Augmentation,” IEEE Transactions on Image Processing, vol. 33, 2024.
- “Better Understanding Differences in Attribution Methods via Systematic Evaluations,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 46, no. 6, 2024.
- “MTR++: Multi-Agent Motion Prediction With Symmetric Scene Modeling and Guided Intention Querying,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 46, no. 5, 2024.
- “Intra- & Extra-Source Exemplar-Based Style Synthesis for Improved Domain Generalization,” International Journal of Computer Vision, vol. 132, 2024.
- “An Evaluation of Zero-Cost Proxies - From Neural Architecture Performance Prediction to Model Robustness,” International Journal of Computer Vision, 2024.
- “How Do Training Methods Influence the Utilization of Vision Models?,” in Interpretable AI: Past, Present and Future (IAI Workshop @ NeurIPS 2024), Vancouver, Canada, 2024.
- “Toward a Diffusion-Based Generalist for Dense Vision Tasks,” in MMFM2, The 2nd Workshop on What is Next in Multimodal Foundation Models?, Seattle, WA, USA, 2024.
- “CosPGD: An Efficient White-Box Adversarial Attack for Pixel-Wise Prediction Tasks,” in Proceedings of the 41st International Conference on Machine Learning (ICML 2024), Vienna, Austria, 2024.
- “Adaptive Hierarchical Certification for Segmentation using Randomized Smoothing,” in Proceedings of the 41st International Conference on Machine Learning (ICML 2024), Vienna, Austria, 2024.
- “Implicit Representations for Constrained Image Segmentation,” in Proceedings of the 41st International Conference on Machine Learning (ICML 2024), Vienna, Austria, 2024.
- “MultiMax: Sparse and Mulit-Modal Attention Learning,” in Proceedings of the 41st International Conference on Machine Learning (ICML 2024), Vienna, Austria, 2024.
- “Adversarial Supervision Makes Layout-to-Image Diffusion Models Thrive,” in The Twelfth International Conference on Learning Representations (ICLR 2024), Vienna, Austria, 2024.
- “On Adversarial Training without Perturbing all Examples,” in The Twelfth International Conference on Learning Representations (ICLR 2024), Vienna, Austria, 2024.
- “Learning the essential in less than 2k additional weights - a simple approach to improve image classification stability under corruptions,” Transactions on Machine Learning Research, vol. 2024, no. 6, 2024.
- “As large as it gets - Studying Infinitely Large Convolutions via Neural Implicit Frequency Filters,” Transactions on Machine Learning Research, vol. 2024, 2024.
- “Wakening Past Concepts without Past Data: Class-Incremental Learning from Online Placebos,” in WACV 2024, IEEE Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 2024.
- “Efficient and Differentiable Combinatorial Optimization for Visual Computing,” Universität des Saarlandes, Saarbrücken, 2024.
- “Beware of Aliases -- Signal Preservation is Crucial for Robust Image Restoration,” 2024. [Online]. Available: https://arxiv.org/abs/2406.07435.more
Abstract
Image restoration networks are usually comprised of an encoder and a decoder,
responsible for aggregating image content from noisy, distorted data and to
restore clean, undistorted images, respectively. Data aggregation as well as
high-resolution image generation both usually come at the risk of involving
aliases, i.e.~standard architectures put their ability to reconstruct the model
input in jeopardy to reach high PSNR values on validation data. The price to be
paid is low model robustness. In this work, we show that simply providing
alias-free paths in state-of-the-art reconstruction transformers supports
improved model robustness at low costs on the restoration performance. We do so
by proposing BOA-Restormer, a transformer-based image restoration model that
executes downsampling and upsampling operations partly in the frequency domain
to ensure alias-free paths along the entire model while potentially preserving
all relevant high-frequency information. - “Towards Designing Inherently Interpretable Deep Neural Networks for Image Classification,” Universität des Saarlandes, Saarbrücken, 2024.
- “Scribbles for All: Benchmarking Scribble Supervised Segmentation Across Datasets,” 2024. [Online]. Available: https://arxiv.org/abs/2408.12489.more
Abstract
In this work, we introduce Scribbles for All, a label and training data
generation algorithm for semantic segmentation trained on scribble labels.
Training or fine-tuning semantic segmentation models with weak supervision has
become an important topic recently and was subject to significant advances in
model quality. In this setting, scribbles are a promising label type to achieve
high quality segmentation results while requiring a much lower annotation
effort than usual pixel-wise dense semantic segmentation annotations. The main
limitation of scribbles as source for weak supervision is the lack of
challenging datasets for scribble segmentation, which hinders the development
of novel methods and conclusive evaluations. To overcome this limitation,
Scribbles for All provides scribble labels for several popular segmentation
datasets and provides an algorithm to automatically generate scribble labels
for any dataset with dense annotations, paving the way for new insights and
model advancements in the field of weakly supervised segmentation. In addition
to providing datasets and algorithm, we evaluate state-of-the-art segmentation
models on our datasets and show that models trained with our synthetic labels
perform competitively with respect to models trained on manual labels. Thus,
our datasets enable state-of-the-art research into methods for scribble-labeled
semantic segmentation. The datasets, scribble generation algorithm, and
baselines are publicly available at github.com/wbkit/Scribbles4All - “Sailing in High-dimensional Spaces: Low-dimensional Embeddings through Angle Preservation,” 2024. [Online]. Available: https://arxiv.org/abs/2406.09876.more
Abstract
Low-dimensional embeddings (LDEs) of high-dimensional data are ubiquitous in
science and engineering. They allow us to quickly understand the main
properties of the data, identify outliers and processing errors, and inform the
next steps of data analysis. As such, LDEs have to be faithful to the original
high-dimensional data, i.e., they should represent the relationships that are
encoded in the data, both at a local as well as global scale. The current
generation of LDE approaches focus on reconstructing local distances between
any pair of samples correctly, often out-performing traditional approaches
aiming at all distances. For these approaches, global relationships are,
however, usually strongly distorted, often argued to be an inherent trade-off
between local and global structure learning for embeddings. We suggest a new
perspective on LDE learning, reconstructing angles between data points. We show
that this approach, Mercat, yields good reconstruction across a diverse set of
experiments and metrics, and preserve structures well across all scales.
Compared to existing work, our approach also has a simple formulation,
facilitating future theoretical analysis and algorithmic improvements. - “Are Vision Language Models Texture or Shape Biased and Can We Steer Them?,” 2024. [Online]. Available: https://arxiv.org/abs/2403.09193.more
Abstract
Vision language models (VLMs) have drastically changed the computer vision
model landscape in only a few years, opening an exciting array of new
applications from zero-shot image classification, over to image captioning, and
visual question answering. Unlike pure vision models, they offer an intuitive
way to access visual content through language prompting. The wide applicability
of such models encourages us to ask whether they also align with human vision -
specifically, how far they adopt human-induced visual biases through multimodal
fusion, or whether they simply inherit biases from pure vision models. One
important visual bias is the texture vs. shape bias, or the dominance of local
over global information. In this paper, we study this bias in a wide range of
popular VLMs. Interestingly, we find that VLMs are often more shape-biased than
their vision encoders, indicating that visual biases are modulated to some
extent through text in multimodal models. If text does indeed influence visual
biases, this suggests that we may be able to steer visual biases not just
through visual input but also through language: a hypothesis that we confirm
through extensive experiments. For instance, we are able to steer shape bias
from as low as 49% to as high as 72% through prompting alone. For now, the
strong human bias towards shape (96%) remains out of reach for all tested VLMs. - “Advancing Image and Video Recognition with Less Supervision,” Universität des Saarlandes, Saarbrücken, 2024.more
Abstract
Deep learning is increasingly relevant in our daily lives, as it simplifies tedious tasks and enhances quality of life across various domains such as entertainment, learning, automatic assistance, and autonomous driving. However, the demand for more data to train models for emerging tasks is increasing dramatically. Deep learning models heavily depend on the quality and quantity of data, necessitating high-quality labeled datasets. Yet, each task requires different types of annotations for training and evaluation, posing challenges in obtaining comprehensive supervision. The acquisition of annotations is not only resource-intensive in terms of time and cost but also introduces biases, such as granularity in classification, where distinctions like specific breeds versus generic categories may arise. Furthermore, the dynamic nature of the world causes the challenge that previously annotated data becomes potentially irrelevant, and new categories and rare occurrences continually emerge, making it impossible to label every aspect of the world.
Therefore, this thesis aims to explore various supervision scenarios to mitigate the need for full supervision and reduce data acquisition costs. Specifically, we investigate learning without labels, referred to as self-supervised and unsupervised methods, to better understand video and image representations. To learn from data without labels, we leverage injected priors such as motion speed, direction, action order in videos, or semantic information granularity to obtain powerful data representations. Further, we study scenarios involving reduced supervision levels. To reduce annotation costs, first, we propose to omit precise annotations for one modality in multimodal learning, namely in text-video and image-video settings, and transfer available knowledge to large copora of video data. Second, we study semi-supervised learning scenarios, where only a subset of annotated data alongside unlabeled data is available, and propose to revisit regularization constraints and improve generalization to unlabeled data. Additionally, we address scenarios where parts of available data is inherently limited due to privacy and security reasons or naturally rare events, which not only restrict annotations but also limit the overall data volume. For these scenarios, we propose methods that carefully balance between previously obtained knowledge and incoming limited data by introducing a calibration method or combining a space reservation technique with orthogonality constraints. Finally, we explore multimodal and unimodal open-world scenarios where the model is asked to generalize beyond the given set of object or action classes. Specifically, we propose a new challenging setting on multimodal egocentric videos and propose an adaptation method for vision-language models to generalize on egocentric domain. Moreover, we study unimodal image recognition in an open-set setting and propose to disentangle open-set detection and image classification tasks that effectively improve generalization in different settings.
In summary, this thesis investigates challenges arising when full supervision for training models is not available. We develop methods to understand learning dynamics and the role of biases in data, while also proposing novel setups to advance training with less supervision. - “Sports-QA: A Large-Scale Video Question Answering Benchmark for Complex and Professional Sports,” 2024. [Online]. Available: https://arxiv.org/abs/2401.01505.more
Abstract
Reasoning over sports videos for question answering is an important task with
numerous applications, such as player training and information retrieval.
However, this task has not been explored due to the lack of relevant datasets
and the challenging nature it presents. Most datasets for video question
answering (VideoQA) focus mainly on general and coarse-grained understanding of
daily-life videos, which is not applicable to sports scenarios requiring
professional action understanding and fine-grained motion analysis. In this
paper, we introduce the first dataset, named Sports-QA, specifically designed
for the sports VideoQA task. The Sports-QA dataset includes various types of
questions, such as descriptions, chronologies, causalities, and counterfactual
conditions, covering multiple sports. Furthermore, to address the
characteristics of the sports VideoQA task, we propose a new Auto-Focus
Transformer (AFT) capable of automatically focusing on particular scales of
temporal information for question answering. We conduct extensive experiments
on Sports-QA, including baseline studies and the evaluation of different
methods. The results demonstrate that our AFT achieves state-of-the-art
performance. - “VSTAR: Generative Temporal Nursing for Longer Dynamic Video Synthesis,” 2024. [Online]. Available: https://arxiv.org/abs/2403.13501.more
Abstract
Despite tremendous progress in the field of text-to-video (T2V) synthesis,
open-sourced T2V diffusion models struggle to generate longer videos with
dynamically varying and evolving content. They tend to synthesize quasi-static
videos, ignoring the necessary visual change-over-time implied in the text
prompt. At the same time, scaling these models to enable longer, more dynamic
video synthesis often remains computationally intractable. To address this
challenge, we introduce the concept of Generative Temporal Nursing (GTN), where
we aim to alter the generative process on the fly during inference to improve
control over the temporal dynamics and enable generation of longer videos. We
propose a method for GTN, dubbed VSTAR, which consists of two key ingredients:
1) Video Synopsis Prompting (VSP) - automatic generation of a video synopsis
based on the original single prompt leveraging LLMs, which gives accurate
textual guidance to different visual states of longer videos, and 2) Temporal
Attention Regularization (TAR) - a regularization technique to refine the
temporal attention units of the pre-trained T2V diffusion models, which enables
control over the video dynamics. We experimentally showcase the superiority of
the proposed approach in generating longer, visually appealing videos over
existing open-sourced T2V models. We additionally analyze the temporal
attention maps realized with and without VSTAR, demonstrating the importance of
applying our method to mitigate neglect of the desired visual change over time. - “FAIR-TAT: Improving Model Fairness Using Targeted Adversarial Training,” 2024. [Online]. Available: https://arxiv.org/abs/2410.23142.more
Abstract
Deep neural networks are susceptible to adversarial attacks and common
corruptions, which undermine their robustness. In order to enhance model
resilience against such challenges, Adversarial Training (AT) has emerged as a
prominent solution. Nevertheless, adversarial robustness is often attained at
the expense of model fairness during AT, i.e., disparity in class-wise
robustness of the model. While distinctive classes become more robust towards
such adversaries, hard to detect classes suffer. Recently, research has focused
on improving model fairness specifically for perturbed images, overlooking the
accuracy of the most likely non-perturbed data. Additionally, despite their
robustness against the adversaries encountered during model training,
state-of-the-art adversarial trained models have difficulty maintaining
robustness and fairness when confronted with diverse adversarial threats or
common corruptions. In this work, we address the above concerns by introducing
a novel approach called Fair Targeted Adversarial Training (FAIR-TAT). We show
that using targeted adversarial attacks for adversarial training (instead of
untargeted attacks) can allow for more favorable trade-offs with respect to
adversarial fairness. Empirical results validate the efficacy of our approach. - “Samba: Synchronized Set-of-Sequences Modeling for Multiple Object Tracking,” 2024. [Online]. Available: https://arxiv.org/abs/2410.01806.more
Abstract
Multiple object tracking in complex scenarios - such as coordinated dance
performances, team sports, or dynamic animal groups - presents unique
challenges. In these settings, objects frequently move in coordinated patterns,
occlude each other, and exhibit long-term dependencies in their trajectories.
However, it remains a key open research question on how to model long-range
dependencies within tracklets, interdependencies among tracklets, and the
associated temporal occlusions. To this end, we introduce Samba, a novel
linear-time set-of-sequences model designed to jointly process multiple
tracklets by synchronizing the multiple selective state-spaces used to model
each tracklet. Samba autoregressively predicts the future track query for each
sequence while maintaining synchronized long-term memory representations across
tracklets. By integrating Samba into a tracking-by-propagation framework, we
propose SambaMOTR, the first tracker effectively addressing the aforementioned
issues, including long-range dependencies, tracklet interdependencies, and
temporal occlusions. Additionally, we introduce an effective technique for
dealing with uncertain observations (MaskObs) and an efficient training recipe
to scale SambaMOTR to longer sequences. By modeling long-range dependencies and
interactions among tracked objects, SambaMOTR implicitly learns to track
objects accurately through occlusions without any hand-crafted heuristics. Our
approach significantly surpasses prior state-of-the-art on the DanceTrack, BFT,
and SportsMOT datasets. - “TokenFormer: Rethinking Transformer Scaling with Tokenized Model Parameters,” 2024. [Online]. Available: https://arxiv.org/abs/2410.23168.more
Abstract
Transformers have become the predominant architecture in foundation models
due to their excellent performance across various domains. However, the
substantial cost of scaling these models remains a significant concern. This
problem arises primarily from their dependence on a fixed number of parameters
within linear projections. When architectural modifications (e.g., channel
dimensions) are introduced, the entire model typically requires retraining from
scratch. As model sizes continue growing, this strategy results in increasingly
high computational costs and becomes unsustainable. To overcome this problem,
we introduce TokenFormer, a natively scalable architecture that leverages the
attention mechanism not only for computations among input tokens but also for
interactions between tokens and model parameters, thereby enhancing
architectural flexibility. By treating model parameters as tokens, we replace
all the linear projections in Transformers with our token-parameter attention
layer, where input tokens act as queries and model parameters as keys and
values. This reformulation allows for progressive and efficient scaling without
necessitating retraining from scratch. Our model scales from 124M to 1.4B
parameters by incrementally adding new key-value parameter pairs, achieving
performance comparable to Transformers trained from scratch while greatly
reducing training costs. Code and models are available at
\url{https://github.com/Haiyang-W/TokenFormer}. - “FaceGPT: Self-supervised Learning to Chat about 3D Human Faces,” 2024. [Online]. Available: https://arxiv.org/abs/2406.07163.more
Abstract
We introduce FaceGPT, a self-supervised learning framework for Large
Vision-Language Models (VLMs) to reason about 3D human faces from images and
text. Typical 3D face reconstruction methods are specialized algorithms that
lack semantic reasoning capabilities. FaceGPT overcomes this limitation by
embedding the parameters of a 3D morphable face model (3DMM) into the token
space of a VLM, enabling the generation of 3D faces from both textual and
visual inputs. FaceGPT is trained in a self-supervised manner as a model-based
autoencoder from in-the-wild images. In particular, the hidden state of LLM is
projected into 3DMM parameters and subsequently rendered as 2D face image to
guide the self-supervised learning process via image-based reconstruction.
Without relying on expensive 3D annotations of human faces, FaceGPT obtains a
detailed understanding about 3D human faces, while preserving the capacity to
understand general user instructions. Our experiments demonstrate that FaceGPT
not only achieves high-quality 3D face reconstructions but also retains the
ability for general-purpose visual instruction following. Furthermore, FaceGPT
learns fully self-supervised to generate 3D faces based on complex textual
inputs, which opens a new direction in human face analysis. - “Number it: Temporal Grounding Videos like Flipping Manga,” 2024. [Online]. Available: https://arxiv.org/abs/2411.10332.more
Abstract
Video Large Language Models (Vid-LLMs) have made remarkable advancements in
comprehending video content for QA dialogue. However, they struggle to extend
this visual understanding to tasks requiring precise temporal localization,
known as Video Temporal Grounding (VTG). To address this gap, we introduce
Number-Prompt (NumPro), a novel method that empowers Vid-LLMs to bridge visual
comprehension with temporal grounding by adding unique numerical identifiers to
each video frame. Treating a video as a sequence of numbered frame images,
NumPro transforms VTG into an intuitive process: flipping through manga panels
in sequence. This allows Vid-LLMs to "read" event timelines, accurately linking
visual content with corresponding temporal information. Our experiments
demonstrate that NumPro significantly boosts VTG performance of top-tier
Vid-LLMs without additional computational cost. Furthermore, fine-tuning on a
NumPro-enhanced dataset defines a new state-of-the-art for VTG, surpassing
previous top-performing methods by up to 6.9\% in mIoU for moment retrieval and
8.5\% in mAP for highlight detection. The code will be available at
github.com/yongliang-wu/NumPro. - “Balancing Diversity and Risk in LLM Sampling: How to Select Your Method and Parameter for Open-Ended Text Generation,” 2024. [Online]. Available: https://arxiv.org/abs/2408.13586.more
Abstract
Sampling-based decoding strategies have been widely adopted for Large
Language Models (LLMs) in numerous applications, which target a balance between
diversity and quality via temperature tuning and tail truncation (e.g., top-k
and top-p sampling). Considering the high dynamic range of the candidate
next-token given different prefixes, recent studies propose to adaptively
truncate the tail of LLM's predicted distribution. Although improved results
haven been reported with these methods on open-ended text generation tasks, the
results are highly dependent on the curated truncation parameters and exemplar
text. In this paper, we propose a systematic way to estimate the intrinsic
capacity of a truncation sampling method by considering the trade-off between
diversity and risk at each decoding step, based on our collected prefix tree
which preserves the context of a full sentence. Our work provides a
comprehensive comparison between existing truncation sampling methods, as well
as their recommended parameters as a guideline for users.