Publications - The Year Before Last
2023
- “Implicit Representations for Image Segmentation,” in UniReps: The First Workshop on Unifying Representations in Neural Models, New Orleans, LA, USA, 2022.
2022
- “Cross-Modal Fusion Distillation for Fine-Grained Sketch-Based Image Retrieval,” in 33rd British Machine Vision Conference (BMVC 2022), London, UK, 2022.
- “Distilling Knowledge from Self-Supervised Teacher by Embedding Graph Alignment,” in 33rd British Machine Vision Conference (BMVC 2022), London, UK, 2022.
- “SP-ViT: Learning 2D Spatial Priors for Vision Transformers,” in 33rd British Machine Vision Conference (BMVC 2022), London, UK, 2022.
- “Relational Proxies: Emergent Relationships as Fine-Grained Discriminators,” in Advances in Neural Information Processing Systems 35 (NeurIPS 2022), New Orleans, LA, USA, 2022.
- “Robust Models are less Over-Confident,” in Advances in Neural Information Processing Systems 35 (NeurIPS 2022), New Orleans, LA, USA, 2022.
- “Trading off Image Quality for Robustness is not Necessary with Regularized Deterministic Autoencoders,” in Advances in Neural Information Processing Systems 35 (NeurIPS 2022), New Orleans, LA, USA, 2022.
- “Motion Transformer with Global Intention Localization and Local Movement Refinement,” in Advances in Neural Information Processing Systems 35 (NeurIPS 2022), New Orleans, LA, USA, 2022.
- “CAGroup3D: Class-Aware Grouping for 3D Object Detection on Point Clouds,” in Advances in Neural Information Processing Systems 35 (NeurIPS 2022), New Orleans, LA, USA, 2022.
- “USB: A Unified Semi-supervised Learning Benchmark for Classification,” in Advances in Neural Information Processing Systems 35 (NeurIPS 2022), New Orleans, LA, USA, 2022.
- “Towards Efficient 3D Object Detection with Knowledge Distillation,” in Advances in Neural Information Processing Systems 35 (NeurIPS 2022), New Orleans, LA, USA, 2022.
- “Abstracting Sketches Through Simple Primitives,” in Computer Vision -- ECCV 2022, Tel Aviv, Israel, 2022.
- “MPPNet: Multi-frame Feature Intertwining with Proxy Points for 3D Temporal Object Detection,” in Computer Vision -- ECCV 2022, Tel Aviv, Israel, 2022.
- “Box2Mask: Weakly Supervised 3D Semantic Instance Segmentation using Bounding Boxes,” in Computer Vision -- ECCV 2022, Tel Aviv, Israel, 2022.
- “Learned Vertex Descent: A New Direction for 3D Human Model Fitting,” in Computer Vision -- ECCV 2022, Tel Aviv, Israel, 2022.
- “DODA: Data-Oriented Sim-to-Real Domain Adaptation for 3D Semantic Segmentation,” in Computer Vision -- ECCV 2022, Tel Aviv, Israel, 2022.
- “TACS: Taxonomy Adaptive Cross-Domain Semantic Segmentation,” in Computer Vision -- ECCV 2022, Tel Aviv, Israel, 2022.
- “Class-Agnostic Object Counting Robust to Intraclass Diversity,” in Computer Vision -- ECCV 2022, Tel Aviv, Israel, 2022.
- “FrequencyLowCut Pooling - Plug & Play against Catastrophic Overfitting,” in Computer Vision -- ECCV 2022, Tel Aviv, Israel, 2022.
- “Improving Robustness by Enhancing Weak Subnets,” in Computer Vision -- ECCV 2022, Tel Aviv, Israel, 2022.
- “A Comparative Study of Graph Matching Algorithms in Computer Vision,” in Computer Vision -- ECCV 2022, Tel Aviv, Israel, 2022.
- “HRDA: Context-Aware High-Resolution Domain-Adaptive Semantic Segmentation,” in Computer Vision -- ECCV 2022, Tel Aviv, Israel, 2022.
- “Skeleton-Free Pose Transfer for Stylized 3D Characters,” in Computer Vision -- ECCV 2022, Tel Aviv, Israel, 2022.
- “CycDA: Unsupervised Cycle Domain Adaptation to Learn from Image to Video,” in Computer Vision -- ECCV 2022, Tel Aviv, Israel, 2022.
- “Learning Where To Look - Generative NAS is Surprisingly Efficient,” in Computer Vision -- ECCV 2022, Tel Aviv, Israel, 2022.
- “Temporal and Cross-modal Attention for Audio-Visual Zero-Shot Learning,” in Computer Vision -- ECCV 2022, Tel Aviv, Israel, 2022.
- “HULC: 3D HUman Motion Capture with Pose Manifold SampLing and Dense Contact Guidance,” in Computer Vision -- ECCV 2022, Tel Aviv, Israel, 2022.
- “Pose-NDF: Modeling Human Pose Manifolds with Neural Distance Fields,” in Computer Vision -- ECCV 2022, Tel Aviv, Israel, 2022.
- “CHORE: Contact, Human and Object Reconstruction from a Single RGB Image,” in Computer Vision -- ECCV 2022, Tel Aviv, Israel, 2022.
- “COUCH: Towards Controllable Human-Chair Interactions,” in Computer Vision -- ECCV 2022, Tel Aviv, Israel, 2022.
- “TOCH: Spatio-Temporal Object Correspondence to Hand for Motion Refinement,” in Computer Vision -- ECCV 2022, Tel Aviv, Israel, 2022.
- “Advancing Translational Research in Neuroscience through Multi-task Learning,” Frontiers in Psychiatry, vol. 13, 2022.
- “Semantic Image Synthesis with Semantically Coupled VQ-Model,” in ICLR Workshop on Deep Generative Models for Highly Structured Data (ICLR 2022 DGM4HSD), Virtual, 2022.
- “RAMA: A Rapid Multicut Algorithm on GPU,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2022), New Orleans, LA, USA, 2022.
- “FastDOG: Fast Discrete Optimization on GPU,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2022), New Orleans, LA, USA, 2022.
- “BEHAVE: Dataset and Method for Tracking Human Object Interactions,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2022), New Orleans, LA, USA, 2022.
- “B-cos Networks: Alignment is All We Need for Interpretability,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2022), New Orleans, LA, USA, 2022.
- “Pix2NeRF: Unsupervised Conditional Pi-GAN for Single Image to Neural Radiance Fields Translation,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2022), New Orleans, LA, USA, 2022.
- “Decoupling Zero-Shot Semantic Segmentation,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2022), New Orleans, LA, USA, 2022.more
Abstract
Zero-shot semantic segmentation (ZS3) aims to segment the novel categories
that have not been seen in the training. Existing works formulate ZS3 as a
pixel-level zero-shot classification problem, and transfer semantic knowledge
from seen classes to unseen ones with the help of language models pre-trained
only with texts. While simple, the pixel-level ZS3 formulation shows the
limited capability to integrate vision-language models that are often
pre-trained with image-text pairs and currently demonstrate great potential for
vision tasks. Inspired by the observation that humans often perform
segment-level semantic labeling, we propose to decouple the ZS3 into two
sub-tasks: 1) a class-agnostic grouping task to group the pixels into segments.
2) a zero-shot classification task on segments. The former sub-task does not
involve category information and can be directly transferred to group pixels
for unseen classes. The latter subtask performs at segment-level and provides a
natural way to leverage large-scale vision-language models pre-trained with
image-text pairs (e.g. CLIP) for ZS3. Based on the decoupling formulation, we
propose a simple and effective zero-shot semantic segmentation model, called
ZegFormer, which outperforms the previous methods on ZS3 standard benchmarks by
large margins, e.g., 35 points on the PASCAL VOC and 3 points on the COCO-Stuff
in terms of mIoU for unseen classes. Code will be released at
github.com/dingjiansw101/ZegFormer. - “PoseTrack21: A Dataset for Person Search, Multi-Object Tracking and Multi-Person Pose Tracking,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2022), New Orleans, LA, USA, 2022.
- “CoSSL: Co-Learning of Representation and Classifier for Imbalanced Semi-Supervised Learning,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2022), New Orleans, LA, USA, 2022.more
Abstract
In this paper, we propose a novel co-learning framework (CoSSL) with
decoupled representation learning and classifier learning for imbalanced SSL.
To handle the data imbalance, we devise Tail-class Feature Enhancement (TFE)
for classifier learning. Furthermore, the current evaluation protocol for
imbalanced SSL focuses only on balanced test sets, which has limited
practicality in real-world scenarios. Therefore, we further conduct a
comprehensive evaluation under various shifted test distributions. In
experiments, we show that our approach outperforms other methods over a large
range of shifted distributions, achieving state-of-the-art performance on
benchmark datasets ranging from CIFAR-10, CIFAR-100, ImageNet, to Food-101. Our
code will be made publicly available. - “Bi-level Alignment for Cross-Domain Crowd Counting,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2022), New Orleans, LA, USA, 2022.
- “LiDAR Snowfall Simulation for Robust 3D Object Detection,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2022), New Orleans, LA, USA, 2022.
- “DAFormer: Improving Network Architectures and Training Strategies for Domain-Adaptive Semantic Segmentation,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2022), New Orleans, LA, USA, 2022.more
Abstract
As acquiring pixel-wise annotations of real-world images for semantic
segmentation is a costly process, a model can instead be trained with more
accessible synthetic data and adapted to real images without requiring their
annotations. This process is studied in unsupervised domain adaptation (UDA).
Even though a large number of methods propose new adaptation strategies, they
are mostly based on outdated network architectures. As the influence of recent
network architectures has not been systematically studied, we first benchmark
different network architectures for UDA and then propose a novel UDA method,
DAFormer, based on the benchmark results. The DAFormer network consists of a
Transformer encoder and a multi-level context-aware feature fusion decoder. It
is enabled by three simple but crucial training strategies to stabilize the
training and to avoid overfitting DAFormer to the source domain: While the Rare
Class Sampling on the source domain improves the quality of pseudo-labels by
mitigating the confirmation bias of self-training towards common classes, the
Thing-Class ImageNet Feature Distance and a learning rate warmup promote
feature transfer from ImageNet pretraining. DAFormer significantly improves the
state-of-the-art performance by 10.8 mIoU for GTA->Cityscapes and 5.4 mIoU for
Synthia->Cityscapes and enables learning even difficult classes such as train,
bus, and truck well. The implementation is available at
github.com/lhoyer/DAFormer. - “Large Loss Matters in Weakly Supervised Multi-Label Classification,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2022), New Orleans, LA, USA, 2022.
- “Stratified Transformer for 3D Point Cloud Segmentation,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2022), New Orleans, LA, USA, 2022.
- “Both Style and Fog Matter: Cumulative Domain Adaptation for Semantic Foggy Scene Understanding,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2022), New Orleans, LA, USA, 2022.more
Abstract
Although considerable progress has been made in semantic scene understanding
under clear weather, it is still a tough problem under adverse weather
conditions, such as dense fog, due to the uncertainty caused by imperfect
observations. Besides, difficulties in collecting and labeling foggy images
hinder the progress of this field. Considering the success in semantic scene
understanding under clear weather, we think it is reasonable to transfer
knowledge learned from clear images to the foggy domain. As such, the problem
becomes to bridge the domain gap between clear images and foggy images. Unlike
previous methods that mainly focus on closing the domain gap caused by fog --
defogging the foggy images or fogging the clear images, we propose to alleviate
the domain gap by considering fog influence and style variation simultaneously.
The motivation is based on our finding that the style-related gap and the
fog-related gap can be divided and closed respectively, by adding an
intermediate domain. Thus, we propose a new pipeline to cumulatively adapt
style, fog and the dual-factor (style and fog). Specifically, we devise a
unified framework to disentangle the style factor and the fog factor
separately, and then the dual-factor from images in different domains.
Furthermore, we collaborate the disentanglement of three factors with a novel
cumulative loss to thoroughly disentangle these three factors. Our method
achieves the state-of-the-art performance on three benchmarks and shows
generalization ability in rainy and snowy scenes. - “Audio-visual Generalised Zero-shot Learning with Cross-modal Attention and Language,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2022), New Orleans, LA, USA, 2022.
- “LMGP: Lifted Multicut Meets Geometry Projections for Multi-Camera Multi-Object Tracking,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2022), New Orleans, LA, USA, 2022.more
Abstract
Multi-Camera Multi-Object Tracking is currently drawing attention in the
computer vision field due to its superior performance in real-world
applications such as video surveillance with crowded scenes or in vast space.
In this work, we propose a mathematically elegant multi-camera multiple object
tracking approach based on a spatial-temporal lifted multicut formulation. Our
model utilizes state-of-the-art tracklets produced by single-camera trackers as
proposals. As these tracklets may contain ID-Switch errors, we refine them
through a novel pre-clustering obtained from 3D geometry projections. As a
result, we derive a better tracking graph without ID switches and more precise
affinity costs for the data association phase. Tracklets are then matched to
multi-camera trajectories by solving a global lifted multicut formulation that
incorporates short and long-range temporal interactions on tracklets located in
the same camera as well as inter-camera ones. Experimental results on the
WildTrack dataset yield near-perfect result, outperforming state-of-the-art
trackers on Campus while being on par on the PETS-09 dataset. We will make our
implementations available upon acceptance of the paper. - “Towards Better Understanding Attribution Methods,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2022), New Orleans, LA, USA, 2022.
- “A Scalable Combinatorial Solver for Elastic Geometrically Consistent 3D Shape Matching,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2022), New Orleans, LA, USA, 2022.
- “SHIFT: A Synthetic Driving Dataset for Continuous Multi-Task Domain Adaptation,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2022), New Orleans, LA, USA, 2022.
- “Generalized Few-shot Semantic Segmentation,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2022), New Orleans, LA, USA, 2022.
- “Scribble-Supervised LiDAR Semantic Segmentation,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2022), New Orleans, LA, USA, 2022.
- “Sound and Visual Representation Learning with Multiple Pretraining Tasks,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2022), New Orleans, LA, USA, 2022.
- “RBGNet: Ray-based Grouping for 3D Object Detection,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2022), New Orleans, LA, USA, 2022.
- “Continual Test-Time Domain Adaptation,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2022), New Orleans, LA, USA, 2022.
- “VGSE: Visually-Grounded Semantic Embeddings for Zero-Shot Learning,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2022), New Orleans, LA, USA, 2022.
- “A Unified Query-based Paradigm for Point Cloud Understanding,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2022), New Orleans, LA, USA, 2022.
- “Adiabatic Quantum Computing for Multi Object Tracking,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2022), New Orleans, LA, USA, 2022.
- “Multi-Scale Interaction for Real-Time LiDAR Data Segmentation on an Embedded Platform,” IEEE Robotics and Automation Letters, vol. 7, no. 2, 2022.
- “Improving Depth Estimation Using Map-Based Depth Priors,” IEEE Robotics and Automation Letters, vol. 7, no. 2, 2022.
- “End-to-End Optimization of LiDAR Beam Configuration for 3D Object Detection and Localization,” IEEE Robotics and Automation Letters, vol. 7, no. 2, 2022.
- “Learnable Online Graph Representations for 3D Multi-Object Tracking,” IEEE Robotics and Automation Letters, vol. 7, no. 2, 2022.
- “Semi-Supervised and Unsupervised Deep Visual Learning: A Survey,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022.
- “DWDN: Deep Wiener Deconvolution Network for Non-Blind Image Deblurring,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 44, no. 12, 2022.
- “Meta-Transfer Learning through Hard Tasks,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 44, no. 3, 2022.
- “Generalized Few-Shot Video Classification With Video Retrieval and Feature Generation,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 44, no. 12, 2022.
- “Hyperspectral Image Super-Resolution with RGB Image Super-Resolution as an Auxiliary Task,” in 2022 IEEE Winter Conference on Applications of Computer Vision (WACV 2022), Waikoloa Village, HI, USA, 2022.
- “ASMCNN: An Efficient Brain Extraction Using Active Shape Model and Convolutional Neural Networks,” Information Sciences, vol. 591, 2022.
- “MoCapDeform: Monocular 3D Human Motion Capture in Deformable Scenes,” in International Conference on 3D Vision, Hybrid / Prague, Czechia, 2022.more
Abstract
3D human motion capture from monocular RGB images respecting interactions of
a subject with complex and possibly deformable environments is a very
challenging, ill-posed and under-explored problem. Existing methods address it
only weakly and do not model possible surface deformations often occurring when
humans interact with scene surfaces. In contrast, this paper proposes
MoCapDeform, i.e., a new framework for monocular 3D human motion capture that
is the first to explicitly model non-rigid deformations of a 3D scene for
improved 3D human pose estimation and deformable environment reconstruction.
MoCapDeform accepts a monocular RGB video and a 3D scene mesh aligned in the
camera space. It first localises a subject in the input monocular video along
with dense contact labels using a new raycasting based strategy. Next, our
human-environment interaction constraints are leveraged to jointly optimise
global 3D human poses and non-rigid surface deformations. MoCapDeform achieves
superior accuracy than competing methods on several datasets, including our
newly recorded one with deforming background scenes. - “PV-RCNN++: Point-Voxel Feature Set Abstraction With Local Vector Representation for 3D Object Detection,” International Journal of Computer Vision, vol. 131, 2022.
- “OASIS: Only Adversarial Supervision for Semantic Image Synthesis,” International Journal of Computer Vision, vol. 130, 2022.
- “Attribute Prototype Network for Any-Shot Learning,” International Journal of Computer Vision, vol. 130, 2022.
- “DPER: Direct Parameter Estimation for Randomly Missing Data,” Knowledge-Based Systems, vol. 240, 2022.
- “Aliasing and Adversarial Robust Generalization of CNNs,” Machine Learning, vol. 111, 2022.
- “Learning to solve Minimum Cost Multicuts efficiently using Edge-Weighted Graph Convolutional Neural Networks,” in Machine Learning and Knowledge Discovery in Databases (ECML PKDD 2022), Grenoble, France, 2022.more
Abstract
The minimum cost multicut problem is the NP-hard/APX-hard combinatorial
optimization problem of partitioning a real-valued edge-weighted graph such as
to minimize the total cost of the partition. While graph convolutional neural
networks (GNN) have proven to be promising in the context of combinatorial
optimization, most of them are only tailored to or tested on positive-valued
edge weights, i.e. they do not comply to the nature of the multicut problem. We
therefore adapt various GNN architectures including Graph Convolutional
Networks, Signed Graph Convolutional Networks and Graph Isomorphic Networks to
facilitate the efficient encoding of real-valued edge costs. Moreover, we
employ a reformulation of the multicut ILP constraints to a polynomial program
as loss function that allows to learn feasible multicut solutions in a scalable
way. Thus, we provide the first approach towards end-to-end trainable
multicuts. Our findings support that GNN approaches can produce good solutions
in practice while providing lower computation times and largely improved
scalability compared to LP solvers and optimized heuristics, especially when
considering large instances. - “TATL: Task Agnostic Transfer Learning for Skin Attributes Detection,” Medical Image Analysis, vol. 78, 2022.
- “Impact of Realistic Properties of the Point Spread Function on Classification Tasks to Reveal a Possible Distribution Shift,” in NeurIPS 2022 Workshop on Distribution Shifts: Connecting Methods and Applications (NeurIPS 2022 Workshop DistShift), New Orelans, LA, USA, 2022.
- “Optimizing Edge Detection for Image Segmentation with Multicut Penalties,” in Pattern Recognition (DAGM GCPR 2022), Konstanz, Germany, 2022.more
Abstract
The Minimum Cost Multicut Problem (MP) is a popular way for obtaining a graph
decomposition by optimizing binary edge labels over edge costs. While the
formulation of a MP from independently estimated costs per edge is highly
flexible and intuitive, solving the MP is NP-hard and time-expensive. As a
remedy, recent work proposed to predict edge probabilities with awareness to
potential conflicts by incorporating cycle constraints in the prediction
process. We argue that such formulation, while providing a first step towards
end-to-end learnable edge weights, is suboptimal, since it is built upon a
loose relaxation of the MP. We therefore propose an adaptive CRF that allows to
progressively consider more violated constraints and, in consequence, to issue
solutions with higher validity. Experiments on the BSDS500 benchmark for
natural image segmentation as well as on electron microscopic recordings show
that our approach yields more precise edge detection and image segmentation. - “Keypoint Message Passing for Video-Based Person Re-identification,” in Proceedings of the 36th AAAI Conference on Artificial Intelligence, Virtual Conference, 2022.
- “PlanT: Explainable Planning Transformers via Object-Level Representations,” in Proceedings of the 6th Annual Conference on Robot Learning (CoRL 2022), Auckland, New Zealand, 2022.more
Abstract
Planning an optimal route in a complex environment requires efficient
reasoning about the surrounding scene. While human drivers prioritize important
objects and ignore details not relevant to the decision, learning-based
planners typically extract features from dense, high-dimensional grid
representations containing all vehicle and road context information. In this
paper, we propose PlanT, a novel approach for planning in the context of
self-driving that uses a standard transformer architecture. PlanT is based on
imitation learning with a compact object-level input representation. On the
Longest6 benchmark for CARLA, PlanT outperforms all prior methods (matching the
driving score of the expert) while being 5.3x faster than equivalent
pixel-based planning baselines during inference. Combining PlanT with an
off-the-shelf perception module provides a sensor-based driving system that is
more than 10 points better in terms of driving score than the existing state of
the art. Furthermore, we propose an evaluation protocol to quantify the ability
of planners to identify relevant objects, providing insights regarding their
decision-making. Our results indicate that PlanT can focus on the most relevant
object in the scene, even when this object is geometrically distant. - “Two-Stage Movie Script Summarization: An Efficient Method For Low-Resource Long Document Summarization,” in Proceedings of The Workshop on Automatic Summarization for Creative Writing (COLING 2022), Gyeongju, Republic of Korea, 2022.
- “An Embarrassingly Simple Baseline for Imbalanced Semi-Supervised Learning,” 2022. [Online]. Available: https://arxiv.org/abs/2211.11086.more
Abstract
Semi-supervised learning (SSL) has shown great promise in leveraging
unlabeled data to improve model performance. While standard SSL assumes uniform
data distribution, we consider a more realistic and challenging setting called
imbalanced SSL, where imbalanced class distributions occur in both labeled and
unlabeled data. Although there are existing endeavors to tackle this challenge,
their performance degenerates when facing severe imbalance since they can not
reduce the class imbalance sufficiently and effectively. In this paper, we
study a simple yet overlooked baseline -- SimiS -- which tackles data imbalance
by simply supplementing labeled data with pseudo-labels, according to the
difference in class distribution from the most frequent class. Such a simple
baseline turns out to be highly effective in reducing class imbalance. It
outperforms existing methods by a significant margin, e.g., 12.8%, 13.6%, and
16.7% over previous SOTA on CIFAR100-LT, FOOD101-LT, and ImageNet127
respectively. The reduced imbalance results in faster convergence and better
pseudo-label accuracy of SimiS. The simplicity of our method also makes it
possible to be combined with other re-balancing techniques to improve the
performance further. Moreover, our method shows great robustness to a wide
range of data distributions, which holds enormous potential in practice. Code
will be publicly available. - “Leveraging Self-Supervised Training for Unintentional Action Recognition,” 2022. [Online]. Available: https://arxiv.org/abs/2209.11870.more
Abstract
Unintentional actions are rare occurrences that are difficult to define
precisely and that are highly dependent on the temporal context of the action.
In this work, we explore such actions and seek to identify the points in videos
where the actions transition from intentional to unintentional. We propose a
multi-stage framework that exploits inherent biases such as motion speed,
motion direction, and order to recognize unintentional actions. To enhance
representations via self-supervised training for the task of unintentional
action recognition we propose temporal transformations, called Temporal
Transformations of Inherent Biases of Unintentional Actions (T2IBUA). The
multi-stage approach models the temporal information on both the level of
individual frames and full clips. These enhanced representations show strong
performance for unintentional action recognition tasks. We provide an extensive
ablation study of our framework and report results that significantly improve
over the state-of-the-art. - “Normalization Perturbation: A Simple Domain Generalization Method for Real-World Domain Shifts,” 2022. [Online]. Available: https://arxiv.org/abs/2211.04393.more
Abstract
Improving model's generalizability against domain shifts is crucial,
especially for safety-critical applications such as autonomous driving.
Real-world domain styles can vary substantially due to environment changes and
sensor noises, but deep models only know the training domain style. Such domain
style gap impedes model generalization on diverse real-world domains. Our
proposed Normalization Perturbation (NP) can effectively overcome this domain
style overfitting problem. We observe that this problem is mainly caused by the
biased distribution of low-level features learned in shallow CNN layers. Thus,
we propose to perturb the channel statistics of source domain features to
synthesize various latent styles, so that the trained deep model can perceive
diverse potential domains and generalizes well even without observations of
target domain data in training. We further explore the style-sensitive channels
for effective style synthesis. Normalization Perturbation only relies on a
single source domain and is surprisingly effective and extremely easy to
implement. Extensive experiments verify the effectiveness of our method for
generalizing models under real-world domain shifts. - “Visually Plausible Human-Object Interaction Capture from Wearable Sensors,” 2022. [Online]. Available: https://arxiv.org/abs/2205.02830.more
Abstract
In everyday lives, humans naturally modify the surrounding environment
through interactions, e.g., moving a chair to sit on it. To reproduce such
interactions in virtual spaces (e.g., metaverse), we need to be able to capture
and model them, including changes in the scene geometry, ideally from
ego-centric input alone (head camera and body-worn inertial sensors). This is
an extremely hard problem, especially since the object/scene might not be
visible from the head camera (e.g., a human not looking at a chair while
sitting down, or not looking at the door handle while opening a door). In this
paper, we present HOPS, the first method to capture interactions such as
dragging objects and opening doors from ego-centric data alone. Central to our
method is reasoning about human-object interactions, allowing to track objects
even when they are not visible from the head camera. HOPS localizes and
registers both the human and the dynamic object in a pre-scanned static scene.
HOPS is an important first step towards advanced AR/VR applications based on
immersive virtual universes, and can provide human-centric training data to
teach machines to interact with their surroundings. The supplementary video,
data, and code will be available on our project page at
virtualhumans.mpi-inf.mpg.de/hops/ - “Lifted Edges as Connectivity Priors for Multicut and Disjoint Paths,” Universität des Saarlandes, Saarbrücken, 2022.
- “Deep Gradient Learning for Efficient Camouflaged Object Detection,” 2022. [Online]. Available: https://arxiv.org/pdf/2205.12853.pdf.more
Abstract
This paper introduces DGNet, a novel deep framework that exploits object
gradient supervision for camouflaged object detection (COD). It decouples the
task into two connected branches, i.e., a context and a texture encoder. The
essential connection is the gradient-induced transition, representing a soft
grouping between context and texture features. Benefiting from the simple but
efficient framework, DGNet outperforms existing state-of-the-art COD models by
a large margin. Notably, our efficient version, DGNet-S, runs in real-time (80
fps) and achieves comparable results to the cutting-edge model
JCSOD-CVPR$_{21}$ with only 6.82% parameters. Application results also show
that the proposed DGNet performs well in polyp segmentation, defect detection,
and transparent object segmentation tasks. Codes will be made available at
github.com/GewelsJI/DGNet. - “MTR-A: 1st Place Solution for 2022 Waymo Open Dataset Challenge -- Motion Prediction,” 2022. [Online]. Available: https://arxiv.org/abs/2209.10033.more
Abstract
In this report, we present the 1st place solution for motion prediction track
in 2022 Waymo Open Dataset Challenges. We propose a novel Motion Transformer
framework for multimodal motion prediction, which introduces a small set of
novel motion query pairs for generating better multimodal future trajectories
by jointly performing the intention localization and iterative motion
refinement. A simple model ensemble strategy with non-maximum-suppression is
adopted to further boost the final performance. Our approach achieves the 1st
place on the motion prediction leaderboard of 2022 Waymo Open Dataset
Challenges, outperforming other methods with remarkable margins. Code will be
available at github.com/sshaoshuai/MTR. - “Understanding and Improving Robustness and Uncertainty Estimation in Deep Learning,” Universität des Saarlandes, Saarbrücken, 2022.more
Abstract
Deep learning is becoming increasingly relevant for many high-stakes applications such as autonomous driving or medical diagnosis where wrong decisions can have massive impact on human lives. Unfortunately, deep neural networks are typically assessed solely based on generalization, e.g., accuracy on a fixed test set. However, this is clearly insufficient for safe deployment as potential malicious actors and distribution shifts or the effects of quantization and unreliable hardware are disregarded. Thus, recent work additionally evaluates performance on potentially manipulated or corrupted inputs as well as after quantization and deployment on specialized hardware. In such settings, it is also important to obtain reasonable estimates of the model's confidence alongside its predictions. This thesis studies robustness and uncertainty estimation in deep learning along three main directions: First, we consider so-called adversarial examples, slightly perturbed inputs causing severe drops in accuracy. Second, we study weight perturbations, focusing particularly on bit errors in quantized weights. This is relevant for deploying models on special-purpose hardware for efficient inference, so-called accelerators. Finally, we address uncertainty estimation to improve robustness and provide meaningful statistical performance guarantees for safe deployment. In detail, we study the existence of adversarial examples with respect to the underlying data manifold. In this context, we also investigate adversarial training which improves robustness by augmenting training with adversarial examples at the cost of reduced accuracy. We show that regular adversarial examples leave the data manifold in an almost orthogonal direction. While we find no inherent trade-off between robustness and accuracy, this contributes to a higher sample complexity as well as severe overfitting of adversarial training. Using a novel measure of flatness in the robust loss landscape with respect to weight changes, we also show that robust overfitting is caused by converging to particularly sharp minima. In fact, we find a clear correlation between flatness and good robust generalization. Further, we study random and adversarial bit errors in quantized weights. In accelerators, random bit errors occur in the memory when reducing voltage with the goal of improving energy-efficiency. Here, we consider a robust quantization scheme, use weight clipping as regularization and perform random bit error training to improve bit error robustness, allowing considerable energy savings without requiring hardware changes. In contrast, adversarial bit errors are maliciously introduced through hardware- or software-based attacks on the memory, with severe consequences on performance. We propose a novel adversarial bit error attack to study this threat and use adversarial bit error training to improve robustness and thereby also the accelerator's security. Finally, we view robustness in the context of uncertainty estimation. By encouraging low-confidence predictions on adversarial examples, our confidence-calibrated adversarial training successfully rejects adversarial, corrupted as well as out-of-distribution examples at test time. Thereby, we are also able to improve the robustness-accuracy trade-off compared to regular adversarial training. However, even robust models do not provide any guarantee for safe deployment. To address this problem, conformal prediction allows the model to predict confidence sets with user-specified guarantee of including the true label. Unfortunately, as conformal prediction is usually applied after training, the model is trained without taking this calibration step into account. To address this limitation, we propose conformal training which allows training conformal predictors end-to-end with the underlying model. This not only improves the obtained uncertainty estimates but also enables optimizing application-specific objectives without losing the provided guarantee. Besides our work on robustness or uncertainty, we also address the problem of 3D shape completion of partially observed point clouds. Specifically, we consider an autonomous driving or robotics setting where vehicles are commonly equipped with LiDAR or depth sensors and obtaining a complete 3D representation of the environment is crucial. However, ground truth shapes that are essential for applying deep learning techniques are extremely difficult to obtain. Thus, we propose a weakly-supervised approach that can be trained on the incomplete point clouds while offering efficient inference. In summary, this thesis contributes to our understanding of robustness against both input and weight perturbations. To this end, we also develop methods to improve robustness alongside uncertainty estimation for safe deployment of deep learning methods in high-stakes applications. In the particular context of autonomous driving, we also address 3D shape completion of sparse point clouds.
- “Structured Prediction Problem Archive,” 2022. [Online]. Available: https://arxiv.org/abs/2202.03574.more
Abstract
Structured prediction problems are one of the fundamental tools in machine
learning. In order to facilitate algorithm development for their numerical
solution, we collect in one place a large number of datasets in easy to read
formats for a diverse set of problem classes. We provide archival links to
datasets, description of the considered problems and problem formats, and a
short summary of problem characteristics including size, number of instances
etc. For reference we also give a non-exhaustive selection of algorithms
proposed in the literature for their solution. We hope that this central
repository will make benchmarking and comparison to established works easier.
We welcome submission of interesting new datasets and algorithms for inclusion
in our archive. - “On Fragile Features and Batch Normalization in Adversarial Training,” 2022. [Online]. Available: https://arxiv.org/abs/2204.12393.more
Abstract
Modern deep learning architecture utilize batch normalization (BN) to
stabilize training and improve accuracy. It has been shown that the BN layers
alone are surprisingly expressive. In the context of robustness against
adversarial examples, however, BN is argued to increase vulnerability. That is,
BN helps to learn fragile features. Nevertheless, BN is still used in
adversarial training, which is the de-facto standard to learn robust features.
In order to shed light on the role of BN in adversarial training, we
investigate to what extent the expressiveness of BN can be used to robustify
fragile features in comparison to random features. On CIFAR10, we find that
adversarially fine-tuning just the BN layers can result in non-trivial
adversarial robustness. Adversarially training only the BN layers from scratch,
in contrast, is not able to convey meaningful adversarial robustness. Our
results indicate that fragile features can be used to learn models with
moderate adversarial robustness, while random features cannot - “Ret3D: Rethinking Object Relations for Efficient 3D Object Detection in Driving Scenes,” 2022. [Online]. Available: https://arxiv.org/abs/2208.08621.more
Abstract
Current efficient LiDAR-based detection frameworks are lacking in exploiting
object relations, which naturally present in both spatial and temporal manners.
To this end, we introduce a simple, efficient, and effective two-stage
detector, termed as Ret3D. At the core of Ret3D is the utilization of novel
intra-frame and inter-frame relation modules to capture the spatial and
temporal relations accordingly. More Specifically, intra-frame relation module
(IntraRM) encapsulates the intra-frame objects into a sparse graph and thus
allows us to refine the object features through efficient message passing. On
the other hand, inter-frame relation module (InterRM) densely connects each
object in its corresponding tracked sequences dynamically, and leverages such
temporal information to further enhance its representations efficiently through
a lightweight transformer network. We instantiate our novel designs of IntraRM
and InterRM with general center-based or anchor-based detectors and evaluate
them on Waymo Open Dataset (WOD). With negligible extra overhead, Ret3D
achieves the state-of-the-art performance, being 5.5% and 3.2% higher than the
recent competitor in terms of the LEVEL 1 and LEVEL 2 mAPH metrics on vehicle
detection, respectively. - “TOCH: Spatio-Temporal Object Correspondence to Hand for Motion Refinement,” 2022. [Online]. Available: https://arxiv.org/abs/2205.07982.more
Abstract
We present TOCH, a method for refining incorrect 3D hand-object interaction
sequences using a data prior. Existing hand trackers, especially those that
rely on very few cameras, often produce visually unrealistic results with
hand-object intersection or missing contacts. Although correcting such errors
requires reasoning about temporal aspects of interaction, most previous work
focus on static grasps and contacts. The core of our method are TOCH fields, a
novel spatio-temporal representation for modeling correspondences between hands
and objects during interaction. The key component is a point-wise
object-centric representation which encodes the hand position relative to the
object. Leveraging this novel representation, we learn a latent manifold of
plausible TOCH fields with a temporal denoising auto-encoder. Experiments
demonstrate that TOCH outperforms state-of-the-art (SOTA) 3D hand-object
interaction models, which are limited to static grasps and contacts. More
importantly, our method produces smooth interactions even before and after
contact. Using a single trained TOCH model, we quantitatively and qualitatively
demonstrate its usefulness for 1) correcting erroneous reconstruction results
from off-the-shelf RGB/RGB-D hand-object reconstruction methods, 2) de-noising,
and 3) grasp transfer across objects. We will release our code and trained
model on our project page at virtualhumans.mpi-inf.mpg.de/toch/ - “Hypergraph Transformer for Skeleton-based Action Recognition,” 2022. [Online]. Available: https://arxiv.org/abs/2211.09590.more
Abstract
Skeleton-based action recognition aims to predict human actions given human
joint coordinates with skeletal interconnections. To model such off-grid data
points and their co-occurrences, Transformer-based formulations would be a
natural choice. However, Transformers still lag behind state-of-the-art methods
using graph convolutional networks (GCNs). Transformers assume that the input
is permutation-invariant and homogeneous (partially alleviated by positional
encoding), which ignores an important characteristic of skeleton data, i.e.,
bone connectivity. Furthermore, each type of body joint has a clear physical
meaning in human motion, i.e., motion retains an intrinsic relationship
regardless of the joint coordinates, which is not explored in Transformers. In
fact, certain re-occurring groups of body joints are often involved in specific
actions, such as the subconscious hand movement for keeping balance. Vanilla
attention is incapable of describing such underlying relations that are
persistent and beyond pair-wise. In this work, we aim to exploit these unique
aspects of skeleton data to close the performance gap between Transformers and
GCNs. Specifically, we propose a new self-attention (SA) extension, named
Hypergraph Self-Attention (HyperSA), to incorporate inherently higher-order
relations into the model. The K-hop relative positional embeddings are also
employed to take bone connectivity into account. We name the resulting model
Hyperformer, and it achieves comparable or better performance w.r.t. accuracy
and efficiency than state-of-the-art GCN architectures on NTU RGB+D, NTU RGB+D
120, and Northwestern-UCLA datasets. On the largest NTU RGB+D 120 dataset, the
significantly improved performance reached by our Hyperformer demonstrates the
underestimated potential of Transformer models in this field.
2021
- “(SP)2Net for Generalized Zero-Label Semantic Segmentation,” in Pattern Recognition (GCPR 2021), Bonn, Germany, 2022.
- “Revisiting Consistency Regularization for Semi-supervised Learning,” in Pattern Recognition (GCPR 2021), Bonn, Germany, 2022.
- “Compositional Mixture Representations for Vision and Text,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPR 2022), New Orleans, LA, USA, 2022.
- “Probabilistic Compositional Embeddings for Multimodal Image Retrieval,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPR 2022), New Orleans, LA, USA, 2022.
2020
- “CLEVR-X: A Visual Reasoning Dataset for Natural Language Explanations,” in xxAI -- Beyond Explainable AI (xxAI @ICML 2020), Vienna, Austria (Virtually), 2022.
- “CLEVR-X: A Visual Reasoning Dataset for Natural Language Explanations,” in xxAI -- Beyond Explainable AI (XXAI @ICML 2020), Vienna, Austria (Virtually), 2022.