Publications

2025

1
Conference paper
D2
V. Guzov, Y. Jiang, F. Hong, G. Pons-Moll, R. Newcombe, C. K. Liu, Y. Ye, and L. Ma
“HMD2: Environment-aware Motion Generation from Single Egocentric Head-Mounted Device,” in 3DV 2025, 12th International Conference on 3D Vision, Singapore.
- PuRe
- BibTeX
2
Conference paper
D2
C. Li, J. Chibane, Y. He, N. Pearl, A. Geiger, and G. Pons-Moll
“Unimotion: Unifying 3D Human Motion Synthesis and Understanding,” in 3DV 2025, 12th International Conference on 3D Vision, Singapore.
- PuRe
- BibTeX
3
Conference paper
D2
K. Raj, C. Wewer, R. Yunus, E. Ilg, and J. E. Lenssen
“Spurfies: Sparse-view Surface Reconstruction using Local Geometry Priors,” in 3DV 2025, International Conference on 3D Vision, Singapore.
- PuRe
- BibTeX
4
Conference paper
D2
T. Wimmer, M. Oechsle, M. Niemeyer, and F. Tombari
“Gaussians-to-Life: Text-Driven Animation of 3D Gaussian Splatting Scenes,” in 3DV 2025, International Conference on 3D Vision, Singapore.
mehr
Abstract
State-of-the-art novel view synthesis methods achieve impressive results for
multi-view captures of static 3D scenes. However, the reconstructed scenes
still lack "liveliness," a key component for creating engaging 3D experiences.
Recently, novel video diffusion models generate realistic videos with complex
motion and enable animations of 2D images, however they cannot naively be used
to animate 3D scenes as they lack multi-view consistency. To breathe life into
the static world, we propose Gaussians2Life, a method for animating parts of
high-quality 3D scenes in a Gaussian Splatting representation. Our key idea is
to leverage powerful video diffusion models as the generative component of our
model and to combine these with a robust technique to lift 2D videos into
meaningful 3D motion. We find that, in contrast to prior work, this enables
realistic animations of complex, pre-existing 3D scenes and further enables the
animation of a large variety of object classes, while related work is mostly
focused on prior-based character animation, or single 3D objects. Our model
enables the creation of consistent, immersive 3D experiences for arbitrary
scenes.
- PuRe
- BibTeX
5
Conference paper
D2
X. Xie, J. E. Lenssen, and G. Pons-Moll
“InterTrack: Tracking Human Object Interaction without Object Templates,” in 3DV 2025, International Conference on 3D Vision, Singapore.
- PuRe
- BibTeX
6
Conference paper
D2
X. Zhang, B. L. Bhatnagar, S. Starke, I. A. Petrov, V. Guzov, H. Dhamo, E. Pérez Pellitero, and G. Pons-Moll
“FORCE: Dataset and Method for Intuitive Physics Guided Human-object Interaction,” in 3DV 2025, International Conference on 3D Vision, Singapore.
- PuRe
- BibTeX
7
Article
D2
B. Andres, S. Di Gregorio, J. Irmai, and J.-H. Lange
“Corrigendum to ‘A polyhedral study of lifted multicuts’ [Discrete Optim. 47 (2023) 100757],” Discrete Optimization, vol. 55, 2025.
8
Conference paper
D2
M. Asim, C. Wewer, T. Wimmer, B. Schiele, and J. E. Lenssen
“MEt3R: Measuring Multi-View Consistency in Generated Images,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2025), Nashville, TN, USA.
- PuRe
- BibTeX
9
Conference paper
D2
P. Flotho, M. Piening, A. Kukleva, and G. Steidl
“T-FAKE: Synthesizing Thermal Images for Facial Landmarking,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2025), Nashville, TN, USA.
- PuRe
- BibTeX
10
Conference paper
D2
F. Hong, V. Guzov, H. J. Kim, Y. Ye, R. Newcombe, Z. Liu, and L. Ma
“EgoLM: Multi-Modal Language Model of Egocentric Motions,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2025), Nashville, TN, USA.
- PuRe
- BibTeX
11
Conference paper
D2
X. Hu, H. Wang, J. E. Lenssen, and B. Schiele
“PersonaHOI: Effortlessly Improving Personalized Face with Human-Object Interaction Generation,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2025), Nashville, TN, USA.
mehr
Abstract
We introduce PersonaHOI, a training- and tuning-free framework that fuses a
general StableDiffusion model with a personalized face diffusion (PFD) model to
generate identity-consistent human-object interaction (HOI) images. While
existing PFD models have advanced significantly, they often overemphasize
facial features at the expense of full-body coherence, PersonaHOI introduces an
additional StableDiffusion (SD) branch guided by HOI-oriented text inputs. By
incorporating cross-attention constraints in the PFD branch and spatial merging
at both latent and residual levels, PersonaHOI preserves personalized facial
details while ensuring interactive non-facial regions. Experiments, validated
by a novel interaction alignment metric, demonstrate the superior realism and
scalability of PersonaHOI, establishing a new standard for practical
personalized face with HOI generation. Our code will be available at
github.com/JoyHuYY1412/PersonaHOI
- PuRe
- BibTeX
12
Conference paper
D2
N. Shvetsova, A. Nagrani, B. Schiele, H. Kuehne, and C. Rupprecht
“Unbiasing through Textual Descriptions: Mitigating Representation Bias in Video Benchmarks,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2025), Nashville, TN, USA.
- PuRe
- BibTeX
13
Conference paper
D2
F. Vogel, W. Bousselham, A. Kukleva, N. Shvetsova, and H. Kuehne
“VideoGEM: Training-free Action Grounding in Videos,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2025), Nashville, TN, USA.
- PuRe
- BibTeX
14
Conference paper
D2
J. Xie, A. Tonioni, N. Rauschmayr, F. Tombari, and B. Schiele
“Test-Time Visual In-Context Tuning,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2025), Nashville, TN, USA.
- PuRe
- BibTeX
15
Conference paper
D2
I. Hossain, J. Fischer, R. Burkholz, and J. Quackenbush
“Pruning Neural Network Models for Gene Regulatory Dynamics Using Data and Domain Knowledge,” in The Second Conference on Parsimony and Learning Recent Spotlight Track (CPAL 2025), Stanford, CA, USA, 2025.
- PuRe
- BibTeX
16
Conference paper
D2
S. Gairola, M. Böhle, F. Locatello, and B. Schiele
“How to Probe: Simple Yet Effective Techniques for Improving Post-hoc Explanations,” in The Thirteenth International Conference on Learning Representations (ICLR 2025 ), Singapore, 2025.
- PuRe
- BibTeX
17
Conference paper
D2
P. Gavrikov, J. Lukasik, S. Jung, R. Geirhos, M. J. Mirza, M. Keuper, and J. Keuper
“Can We Talk Models Into Seeing the World Differently?,” in The Thirteenth International Conference on Learning Representations (ICLR 2025 ), Singapore, 2025.
- PuRe
- BibTeX
18
Conference paper
D2
H. Wang, Y. Fan, M. F. Naeem, Y. Xian, J. E. Lenssen, L. Wang, F. Tombari, and B. Schiele
“TokenFormer: Rethinking Transformer Scaling with Tokenized Model Parameters,” in Thirteenth International Conference on Learning Representations (ICLR 2025), Singapore.
- PuRe
- BibTeX
19
Conference paper
D2
Y. Yuan, Z. Zhang, X. He, A. Nitta, W. Hu, D. Wang, M. Shah, S. Huang, B. Stojanovič, A. Krumholz, J. E. Lenssen, J. Leskovec, and M. Fey
“ContextGNN: Beyond Two-Tower Recommendation Systems,” in Thirteenth International Conference on Learning Representations (ICLR 2025), Singapore.
- PuRe
- BibTeX
20
Paper
D2
C. Chen, E. Saha, J. Fischer, M. B. Guebila, V. Fanfani, K. H. Shutta, M. Padi, K. Glass, D. L. DeMeo, C. M. Lopes-Ramos, and J. Quackenbush
“Identifying Sex Differences in Lung Adenocarcinoma Using Multi-Omics Integrative Protein Signaling Networks.” 2025.
21
Paper
D2
A. Görgün, B. Schiele, and J. Fischer
“VITAL: More Understandable Feature Visualization through Distribution Alignment and Relevant Information Flow,” 2025. [Online]. Available: https://arxiv.org/abs/2503.22399.
mehr
Abstract
Neural networks are widely adopted to solve complex and challenging tasks.
Especially in high-stakes decision-making, understanding their reasoning
process is crucial, yet proves challenging for modern deep networks. Feature
visualization (FV) is a powerful tool to decode what information neurons are
responding to and hence to better understand the reasoning behind such
networks. In particular, in FV we generate human-understandable images that
reflect the information detected by neurons of interest. However, current
methods often yield unrecognizable visualizations, exhibiting repetitive
patterns and visual artifacts that are hard to understand for a human. To
address these problems, we propose to guide FV through statistics of real image
features combined with measures of relevant network flow to generate
prototypical images. Our approach yields human-understandable visualizations
that both qualitatively and quantitatively improve over state-of-the-art FVs
across various architectures. As such, it can be used to decode which
information the network uses, complementing mechanistic circuits that identify
where it is encoded. Code is available at: github.com/adagorgun/VITAL
- PuRe
- BibTeX
22
Paper
D2
R. Hesse, D. Bağcı, B. Schiele, S. Schaub-Meyer, and S. Roth
“Beyond Accuracy: What Matters in Designing Well-Behaved Models?,” 2025. [Online]. Available: https://arxiv.org/abs/2503.17110.
mehr
Abstract
Deep learning has become an essential part of computer vision, with deep
neural networks (DNNs) excelling in predictive performance. However, they often
fall short in other critical quality dimensions, such as robustness,
calibration, or fairness. While existing studies have focused on a subset of
these quality dimensions, none have explored a more general form of
"well-behavedness" of DNNs. With this work, we address this gap by
simultaneously studying nine different quality dimensions for image
classification. Through a large-scale study, we provide a bird's-eye view by
analyzing 326 backbone models and how different training paradigms and model
architectures affect the quality dimensions. We reveal various new insights
such that (i) vision-language models exhibit high fairness on ImageNet-1k
classification and strong robustness against domain changes; (ii)
self-supervised learning is an effective training paradigm to improve almost
all considered quality dimensions; and (iii) the training dataset size is a
major driver for most of the quality dimensions. We conclude our study by
introducing the QUBA score (Quality Understanding Beyond Accuracy), a novel
metric that ranks models across multiple dimensions of quality, enabling
tailored recommendations based on specific user needs.
- PuRe
- BibTeX
23
Paper
D2D6
N. Pham, B. Schiele, A. Kortylewski, and J. Fischer
“Escaping Plato’s Cave: Robust Conceptual Reasoning through Interpretable 3D Neural Object Volumes,” 2025. [Online]. Available: https://arxiv.org/abs/2503.13429.
mehr
Abstract
With the rise of neural networks, especially in high-stakes applications,
these networks need two properties (i) robustness and (ii) interpretability to
ensure their safety. Recent advances in classifiers with 3D volumetric object
representations have demonstrated a greatly enhanced robustness in
out-of-distribution data. However, these 3D-aware classifiers have not been
studied from the perspective of interpretability. We introduce CAVE - Concept
Aware Volumes for Explanations - a new direction that unifies interpretability
and robustness in image classification. We design an inherently-interpretable
and robust classifier by extending existing 3D-aware classifiers with concepts
extracted from their volumetric representations for classification. In an array
of quantitative metrics for interpretability, we compare against different
concept-based approaches across the explainable AI literature and show that
CAVE discovers well-grounded concepts that are used consistently across images,
while achieving superior robustness.
- PuRe
- BibTeX
24
Paper
D2
L. Piccinelli, C. Sakaridis, Y.-H. Yang, M. Segu, S. Li, W. Abbeloos, and L. Van Gool
“UniDepthV2: Universal Monocular Metric Depth Estimation Made Simpler,” 2025. [Online]. Available: https://arxiv.org/abs/2502.20110.
mehr
Abstract
Accurate monocular metric depth estimation (MMDE) is crucial to solving
downstream tasks in 3D perception and modeling. However, the remarkable
accuracy of recent MMDE methods is confined to their training domains. These
methods fail to generalize to unseen domains even in the presence of moderate
domain gaps, which hinders their practical applicability. We propose a new
model, UniDepthV2, capable of reconstructing metric 3D scenes from solely
single images across domains. Departing from the existing MMDE paradigm,
UniDepthV2 directly predicts metric 3D points from the input image at inference
time without any additional information, striving for a universal and flexible
MMDE solution. In particular, UniDepthV2 implements a self-promptable camera
module predicting a dense camera representation to condition depth features.
Our model exploits a pseudo-spherical output representation, which disentangles
the camera and depth representations. In addition, we propose a geometric
invariance loss that promotes the invariance of camera-prompted depth features.
UniDepthV2 improves its predecessor UniDepth model via a new edge-guided loss
which enhances the localization and sharpness of edges in the metric depth
outputs, a revisited, simplified and more efficient architectural design, and
an additional uncertainty-level output which enables downstream tasks requiring
confidence. Thorough evaluations on ten depth datasets in a zero-shot regime
consistently demonstrate the superior performance and generalization of
UniDepthV2. Code and models are available at
github.com/lpiccinelli-eth/UniDepth
- PuRe
- BibTeX
25
Paper
D2
K. Prasse, P. Knab, S. Marton, C. Bartelt, and M. Keuper
“DCBM: Data-Efficient Visual Concept Bottleneck Models,” 2025. [Online]. Available: https://arxiv.org/abs/2412.11576.
mehr
Abstract
Concept Bottleneck Models (CBMs) enhance the interpretability of neural
networks by basing predictions on human-understandable concepts. However,
current CBMs typically rely on concept sets extracted from large language
models or extensive image corpora, limiting their effectiveness in data-sparse
scenarios. We propose Data-efficient CBMs (DCBMs), which reduce the need for
large sample sizes during concept generation while preserving interpretability.
DCBMs define concepts as image regions detected by segmentation or detection
foundation models, allowing each image to generate multiple concepts across
different granularities. This removes reliance on textual descriptions and
large-scale pre-training, making DCBMs applicable for fine-grained
classification and out-of-distribution tasks. Attribution analysis using
Grad-CAM demonstrates that DCBMs deliver visual concepts that can be localized
in test images. By leveraging dataset-specific concepts instead of predefined
ones, DCBMs enhance adaptability to new domains.
- PuRe
- BibTeX
26
Paper
D2
F. Sammani, J. Fischer, and N. Deligiannis
“Unlocking Open-Set Language Accessibility in Vision Models,” 2025. [Online]. Available: https://arxiv.org/abs/2503.10981.
- PuRe
- BibTeX
27
Paper
D2
N. P. Walter, J. Vreeken, and J. Fischer
“Now You See Me! A Framework for Obtaining Class-relevant Saliency Maps,” 2025. [Online]. Available: https://arxiv.org/abs/2503.07346.
- PuRe
- BibTeX
28
Paper
D2RG3
Y. Wang, S. Rao, J.-U. Lee, M. Jobanputra, and V. Demberg
“B-cos LM: Efficiently Transforming Pre-trained Language Models for Improved Explainability,” 2025. [Online]. Available: https://arxiv.org/abs/2502.12992.
mehr
Abstract
Post-hoc explanation methods for black-box models often struggle with
faithfulness and human interpretability due to the lack of explainability in
current neural models. Meanwhile, B-cos networks have been introduced to
improve model explainability through architectural and computational
adaptations, but their application has so far been limited to computer vision
models and their associated training pipelines. In this work, we introduce
B-cos LMs, i.e., B-cos networks empowered for NLP tasks. Our approach directly
transforms pre-trained language models into B-cos LMs by combining B-cos
conversion and task fine-tuning, improving efficiency compared to previous
B-cos methods. Our automatic and human evaluation results demonstrate that
B-cos LMs produce more faithful and human interpretable explanations than post
hoc methods, while maintaining task performance comparable to conventional
fine-tuning. Our in-depth analysis explores how B-cos LMs differ from
conventionally fine-tuned models in their learning processes and explanation
patterns. Finally, we provide practical guidelines for effectively building
B-cos LMs based on our findings. Our code is available at
anonymous.4open.science/r/bcos_lm.
- PuRe
- BibTeX
29
Paper
D2
C. Wewer, B. Pogodzinski, B. Schiele, and J. E. Lenssen
“Spatial Reasoning with Denoising Models,” 2025. [Online]. Available: https://www.arxiv.org/abs/2502.21075.
mehr
Abstract
We introduce Spatial Reasoning Models (SRMs), a framework to perform
reasoning over sets of continuous variables via denoising generative models.
SRMs infer continuous representations on a set of unobserved variables, given
observations on observed variables. Current generative models on spatial
domains, such as diffusion and flow matching models, often collapse to
hallucination in case of complex distributions. To measure this, we introduce a
set of benchmark tasks that test the quality of complex reasoning in generative
models and can quantify hallucination. The SRM framework allows to report key
findings about importance of sequentialization in generation, the associated
order, as well as the sampling strategies during training. It demonstrates, for
the first time, that order of generation can successfully be predicted by the
denoising network itself. Using these findings, we can increase the accuracy of
specific reasoning tasks from 1% to >50%.
- PuRe
- BibTeX

2024

30
Conference paper
D2
D. Antić, G. Tiwari, B. Ozcomlekci, R. Marin, and G. Pons-Moll
“CloSe: A 3D Clothing Segmentation Dataset and Model,” in 3DV 2024, 11th International Conference on 3D Vision, Davos, Switzerland, 2024.
31
Conference paper
D2
V. Guzov, J. Chibane, R. Marin, Y. He, Y. Saracoglu, T. Sattler, and G. Pons-Moll
“Interaction Replica: Tracking Human–Object Interaction and Scene Changes From Human Motion,” in 3DV 2024, 11th International Conference on 3D Vision, Davos, Switzerland, 2024.
32
Conference paper
D2
B. Kabadayi, W. Zielonka, B. L. Bhatnagar, G. Pons-Moll, and J. Thies
“GAN-Avatar: Controllable Personalized GAN-based Human Head Avatar,” in 3DV 2024, 11th International Conference on 3D Vision, Davos, Switzerland, 2024.
33
Conference paper
D2
A. Mir, X. Puig, A. Kanazawa, and G. Pons-Moll
“Generating Continual Human Motion in Diverse 3D Scenes,” in 3DV 2024, 11th International Conference on 3D Vision, Davos, Switzerland, 2024.
34
Conference paper
D2
S. Arya, S. Rao, M. Boehle, and B. Schiele
“B-cosification: Transforming Deep Neural Networks to be Inherently Interpretable,” in Advances in Neural Information Processing Systems 37 (NeurIPS 2024), Vancouver, Canada, 2024.
mehr
Abstract
B-cos Networks have been shown to be effective for obtaining highly human
interpretable explanations of model decisions by architecturally enforcing
stronger alignment between inputs and weight. B-cos variants of convolutional
networks (CNNs) and vision transformers (ViTs), which primarily replace linear
layers with B-cos transformations, perform competitively to their respective
standard variants while also yielding explanations that are faithful by design.
However, it has so far been necessary to train these models from scratch, which
is increasingly infeasible in the era of large, pre-trained foundation models.
In this work, inspired by the architectural similarities in standard DNNs and
B-cos networks, we propose 'B-cosification', a novel approach to transform
existing pre-trained models to become inherently interpretable. We perform a
thorough study of design choices to perform this conversion, both for
convolutional neural networks and vision transformers. We find that
B-cosification can yield models that are on par with B-cos models trained from
scratch in terms of interpretability, while often outperforming them in terms
of classification performance at a fraction of the training cost. Subsequently,
we apply B-cosification to a pretrained CLIP model, and show that, even with
limited data and compute cost, we obtain a B-cosified version that is highly
interpretable and competitive on zero shot performance across a variety of
datasets. We release our code and pre-trained model weights at
github.com/shrebox/B-cosification.
- PuRe
- BibTeX
35
Conference paper
D2
W. Böttcher, L. Hoyer, O. Unal, J. E. Lenssen, and B. Schiele
“Scribbles for All: Benchmarking Scribble Supervised Segmentation Across Datasets,” in Advances in Neural Information Processing Systems 37 (NeurIPS 2024), Vancouver, Canada, 2024.
- PuRe
- BibTeX
36
Conference paper
D2
J. Chen, J. E. Lenssen, A. Feng, W. Hu, M. Fey, L. Tassiulas, J. Leskovec, and R. Ying
“From Similarity to Superiority: Channel Clustering for Time Series Forecasting,” in Advances in Neural Information Processing Systems 37 (NeurIPS 2024), Vancouver, Canada, 2024.
- PuRe
- BibTeX
37
Conference paper
D2
I. Hossain, J. Fischer, R. Burkholz, and J. Quackenbush
“Pruning Neural Network Models for Gene Regulatory Dynamics Using Data and Domain Knowledge,” in Advances in Neural Information Processing Systems 37 (NeurIPS 2024), Vancouver, Canada, 2024.
- PuRe
- BibTeX
38
Conference paper
D2
J. Robinson, R. Ranjan, W. Hu, K. Huang, J. Han, A. Dobles, M. Fey, J. E. Lenssen, Y. Yuan, Z. Zhang, X. He, and J. Leskovec
“RelBench: A Benchmark for Deep Learning on Relational Databases,” in Advances in Neural Information Processing Systems 37 (NeurIPS 2024), Vancouver, Canada, 2024.
- PuRe
- BibTeX
39
Conference paper
D2
I. Sárándi and G. Pons-Moll
“Neural Localizer Fields for Continuous 3D Human Pose and Shape Estimation,” in Advances in Neural Information Processing Systems 37 (NeurIPS 2024), Vancouver, Canada, 2024.
- PuRe
- BibTeX
40
Conference paper
D2
Y. Xue, X. Xie, R. Marin, and G. Pons-Moll
“Human 3Diffusion: Realistic Avatar Creation via Explicit 3D Consistent Diffusion Models,” in Advances in Neural Information Processing Systems 37 (NeurIPS 2024), Vancouver, Canada, 2024.
- PuRe
- BibTeX
41
Article
D2D6
R. Yunus, J. E. Lenssen, M. Niemeyer, Y. Liao, C. Rupprecht, C. Theobalt, G. Pons-Moll, J.-B. Huang, V. Golyanik, and E. Ilg,
“Recent Trends in 3D Reconstruction of General Non-Rigid Scenes,” Computer Graphics Forum (Proc. EUROGRAPHICS 2024), vol. 43, no. 2, 2024.
42
Conference paper
D2
S. Agnihotri, J. Grabinski, and M. Keuper
“Improving Feature Stability during Upsampling - Spectral Artifacts and the Importance of Spatial Context,” in Computer Vision -- ECCV 2024, Milano, Italy, 2024.
43
Conference paper
D2
A. Das, X. Hu, L. Jiang, and B. Schiele
“MTA-CLIP: Language-Guided Semantic Segmentation with Mask-Text Alignment,” in Computer Vision -- ECCV 2024, Milano, Italy, 2024.
44
Conference paper
D2
L. Ma, Y. Ye, F. Hong, V. Guzov, Y. Jiang, R. Postyeni, L. Pesqueira, A. Gamino, V. Baiyya, H. J. Kim, K. Bailey, D. S. Fosas, C. K. Liu, Z. Liu, J. Engel, R. De Nardi, and R. Newcombe
“Nymeria: A Massive Collection of Multimodal Egocentric Daily Motion in the Wild,” in Computer Vision -- ECCV 2024, Milano, Italy, 2024.
45
Conference paper
D2
R. Marin, E. Corona, and G. Pons-Moll
“NICP: Neural ICP for 3D Human Registration at Scale,” in Computer Vision -- ECCV 2024, Milano, Italy, 2024.
46
Conference paper
D2
A. Parchami-Araghi, M. Böhle, S. S. Rao, and B. Schiele
“Good Teachers Explain: Explanation-Enhanced Knowledge Distillation,” in Computer Vision -- ECCV 2024, Milano, Italy, 2024.
47
Conference paper
D2
S. Rao, S. Mahajan, M. Böhle, and B. Schiele
“Discover-then-Name: Task-Agnostic Concept Bottlenecks via Automated Concept Discovery,” in Computer Vision -- ECCV 2024, Milan, Italy, 2024.
48
Conference paper
D2
P. Roetzer, A. Abbas, D. Cao, F. Bernard, and P. Swoboda
“DiscoMatch: Fast Discrete Optimisation for Geometrically Consistent 3D Shape Matching,” in Computer Vision -- ECCV 2024, Milano, Italy, 2024.
49
Conference paper
D2
M. Segu, L. Piccinelli, S. Li, L. V. Gool, F. Yu, and B. Schiele
“Walker: Self-supervised Multiple Object Tracking by Walking on Temporal Appearance Graphs,” in Computer Vision -- ECCV 2024, Milan, Italy, 2024.
50
Conference paper
D2
N. Shvetsova, A. Kukleva, X. Hong, C. Rupprecht, B. Schiele, and H. Kuehne
“HowToCaption: Prompting LLMs to Transform Video Annotations at Scale,” in Computer Vision -- ECCV 2024, Milano, Italy, 2024.
51
Conference paper
D2
H. Wang, H. Tang, L. Jiang, S. Shi, M. F. Naeem, H. Li, B. Schiele, and L. Wang
“GiT: Towards Generalist Vision Transformer through Universal Language Interface,” in Computer Vision -- ECCV 2024, Milano, Italy, 2024.
52
Conference paper
D2
C. Wewer, K. Raj, E. Ilg, B. Schiele, and J. E. Lenssen
“latentSplat: Autoencoding Variational Gaussians for Fast Generalizable 3D Reconstruction,” in Computer Vision -- ECCV 2024, Milano, Italy, 2024.
53
Conference paper
D2
Y. Yue, A. Das, F. Engelmann, S. Tang, and J. E. Lenssen
“Improving 2D Feature Representations by 3D-Aware Fine-Tuning,” in Computer Vision -- ECCV 2024, Milano, Italy, 2024.
54
Conference paper
D2
S. Paul, C. Wewer, B. Schiele, and J. E. Lenssen
“Sp2360: Sparse-view 360° Scene Reconstruction using Cascaded 2D Diffusion Priors,” in ECCV 2024 Workshop on Wild 3D (ECCV 2024 Wild3D), Milan, Italy, 2024.
- PuRe
- BibTeX
55
Conference paper
D2
U. A. Kaplan, Y. Li, M. Keuper, A. Khoreva, and D. Zhang
“Domain-Aware Fine-Tuning of Foundation Models,” in ICML 2024 Workshop on Foundation Models in the Wild (ICML 2024 FM-Wild Workshop), Vienna, Austria, 2024.
- PuRe
- BibTeX
56
Conference paper
D2
N. Ahmed, A. Kukleva, and B. Schiele
“OrCo: Towards Better Generalization via Orthogonality and Contrast for Few-Shot Class-Incremental Learning,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2024), Seattle, WA, USA, 2024.
57
Conference paper
D2
D. Das, C. Wewer, R. Yunus, E. Ilg, and J. E. Lenssen
“Neural Parametric Gaussians for Monocular Non-Rigid Object Reconstruction,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2024), Seattle, WA, USA, 2024.
58
Conference paper
D2
Y. He, G. Tiwari, T. Birdal, J. E. Lenssen, and G. Pons-Moll
“NRDF: Neural Riemannian Distance Fields for Learning Articulated Pose Priors,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2024), Seattle, WA, USA, 2024.
59
Conference paper
D2
X. Hu, L. Jiang, and B. Schiele
“Training Vision Transformers for Semi-Supervised Semantic Segmentation,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2024), Seattle, WA, USA, 2024.
60
Conference paper
D2
L. Jiang, S. Shi, and B. Schiele
“Open-Vocabulary 3D Semantic Segmentation with Foundation Models,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2024), Seattle, WA, USA, 2024.
61
Conference paper
D2
A. Kukleva, F. Sener, E. Remelli, B. Tekin, E. Sauser, B. Schiele, and S. Ma
“X-MIC: Cross-Modal Instance Conditioning for Egocentric Action Generalization,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2024), Seattle, WA, USA, 2024.
62
Conference paper
D2
P. Schröppel, C. Wewer, J. E. Lenssen, E. Ilg, and T. Brox
“Neural Point Cloud Diffusion for Disentangled 3D Shape and Appearance Generation,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2024), Seattle, WA, USA, 2024.
63
Conference paper
D2
X. Wu, L. Jiang, P.-S. Wang, Z. Liu, X. Liu, Y. Qiao, W. Ouyang, T. He, and H. Zhao
“Point Transformer V3: Simpler, Faster, Stronger,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2024), Seattle, WA, USA, 2024.
64
Conference paper
D2
X. Xie, B. L. Bhatnagar, J. E. Lenssen, and G. Pons-Moll
“Template Free Reconstruction of Human-object Interaction with Procedural Interaction Generation,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2024), Seattle, WA, USA, 2024.
65
Conference paper
D2
K. Youwang, T.-H. Oh, and G. Pons-Moll
“Paint-it: Text-to-Texture Synthesis via Deep Convolutional Texture Map Optimization and Physically-Based Rendering,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2024), Seattle, WA, USA, 2024.
66
Conference paper
D2
K. Zhou, B. L. Bhatnagar, J. E. Lenssen, and G. Pons-Moll
“GEARS: Local Geometry-aware Hand-object Interaction Synthesis,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2024), Seattle, WA, USA, 2024.
67
Conference paper
D2
H. Sommerhoff, S. Agnihotri, M. Saleh, M. Moeller, M. Keuper, and B. Choubey
“Task Driven Sensor Layouts - Joint Optimization of Pixel Layout and Network Parameters,” in IEEE International Conference on Computational Photography (ICCP 2024), Lausanne, Switzerland, 2024.
68
Article
D2
Y. Chen, Y. Guo, D. Liao, F. Lv, H. Song, and J. T.-Y. Kwok
“Automated Dominative Subspace Mining for Efficient Neural Architecture Search,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 34, no. 10, 2024.
69
Article
D2
H. Pan, Y. Guo, M. Yu, and J. Chen
“Enhanced Long-Tailed Recognition With Contrastive CutMix Augmentation,” IEEE Transactions on Image Processing, vol. 33, 2024.
70
Article
D2
M. Böhle, N. Singh, M. Fritz, and B. Schiele
“B-Cos Alignment for Inherently Interpretable CNNs and Vision Transformers,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 46, no. 6, 2024.
71
Article
D2
Y. Chen, M. Mancini, X. Zhu, and Z. Akata
“Semi-Supervised and Unsupervised Deep Visual Learning: A Survey,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 46, no. 3, 2024.
72
Article
D2
S. Rao, M. Boehle, and B. Schiele
“Better Understanding Differences in Attribution Methods via Systematic Evaluations,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 46, no. 6, 2024.
73
Article
D2
S. Shi, L. Jiang, D. Dai, and B. Schiele
“MTR++: Multi-Agent Motion Prediction With Symmetric Scene Modeling and Guided Intention Querying,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 46, no. 5, 2024.
74
Article
D2
Y. Li, D. Zhang, M. Keuper, and A. Khoreva
“Intra- & Extra-Source Exemplar-Based Style Synthesis for Improved Domain Generalization,” International Journal of Computer Vision, vol. 132, 2024.
75
Article
D2
J. Lukasik, M. Moeller, and M. Keuper
“An Evaluation of Zero-Cost Proxies - From Neural Architecture Performance Prediction to Model Robustness,” International Journal of Computer Vision, 2024.
76
Conference paper
D2
P. Gavrikov, S. Agnihotri, M. Keuper, and J. Keuper
“How Do Training Methods Influence the Utilization of Vision Models?,” in Interpretable AI: Past, Present and Future (IAI Workshop @ NeurIPS 2024), Vancouver, Canada, 2024.
- PuRe
- BibTeX
77
Conference paper
D2
Y. Fan, Y. Xian, X. Zhai, A. Kolesnikov, M. F. Naeem, B. Schiele, and F. Tombari
“Toward a Diffusion-Based Generalist for Dense Vision Tasks,” in MMFM2, The 2nd Workshop on What is Next in Multimodal Foundation Models?, Seattle, WA, USA, 2024.
- PuRe
- BibTeX
78
Conference paper
D2
A. Abbas and P. Swoboda
“DOGE-Train: Discrete Optimization on GPU with End-to-End Training,” in Proceedings of the 38th AAAI Conference on Artificial Intelligence, Vancouver, Canada, 2024.
79
Conference paper
D2
S. Agnihotri, S. Jung, and M. Keuper
“CosPGD: An Efficient White-Box Adversarial Attack for Pixel-Wise Prediction Tasks,” in Proceedings of the 41st International Conference on Machine Learning (ICML 2024), Vienna, Austria, 2024.
- PuRe
- BibTeX
80
Conference paper
D2
A. Anani, T. Lorenz, B. Schiele, and M. Fritz
“Adaptive Hierarchical Certification for Segmentation using Randomized Smoothing,” in Proceedings of the 41st International Conference on Machine Learning (ICML 2024), Vienna, Austria, 2024.
- PuRe
- BibTeX
81
Conference paper
D2
M. Fey, W. Hu, K. Huang, J. E. Lenssen, R. Ranjan, J. Robinson, R. Ying, J. You, and J. Leskovec
“Position: Relational Deep Learning - Graph Representation Learning on Relational Databases,” in Proceedings of the 41st International Conference on Machine Learning (ICML 2024), Vienna, Austria, 2024.
- PuRe
- BibTeX
82
Conference paper
D2
J. P. Schneider, M. Fatima, J. Lukasik, A. Kolb, M. Keuper, and M. Moeller
“Implicit Representations for Constrained Image Segmentation,” in Proceedings of the 41st International Conference on Machine Learning (ICML 2024), Vienna, Austria, 2024.
- PuRe
- BibTeX
83
Conference paper
D2
Y. Zhou, M. Fritz, and M. Keuper
“MultiMax: Sparse and Mulit-Modal Attention Learning,” in Proceedings of the 41st International Conference on Machine Learning (ICML 2024), Vienna, Austria, 2024.
- PuRe
- BibTeX
84
Conference paper
D2
Y. Li, M. Keuper, D. Zhang, and A. Khoreva
“Adversarial Supervision Makes Layout-to-Image Diffusion Models Thrive,” in The Twelfth International Conference on Learning Representations (ICLR 2024), Vienna, Austria, 2024.
- PuRe
- BibTeX
85
Conference paper
D2
M. Losch, M. Omran, D. Stutz, M. Fritz, and B. Schiele
“On Adversarial Training without Perturbing all Examples,” in The Twelfth International Conference on Learning Representations (ICLR 2024), Vienna, Austria, 2024.
- PuRe
- BibTeX
86
Article
D2
K. Bäuerle, P. Müller, S. M. Kazim, I. Ihrke, and M. Keuper
“Learning the essential in less than 2k additional weights - a simple approach to improve image classification stability under corruptions,” Transactions on Machine Learning Research, vol. 2024, no. 6, 2024.
- PuRe
- BibTeX
87
Article
D2
J. Grabinski, J. Keuper, and M. Keuper
“As large as it gets - Studying Infinitely Large Convolutions via Neural Implicit Frequency Filters,” Transactions on Machine Learning Research, vol. 2024, 2024.
- PuRe
- BibTeX
88
Conference paper
D2
Y. Liu, Y. Li, B. Schiele, and Q. Sun
“Wakening Past Concepts without Past Data: Class-Incremental Learning from Online Placebos,” in WACV 2024, IEEE Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 2024.
89
Conference paper
D2
N. Pham and M. Schott
“H-POPE: Hierarchical Polling-based Probing Evaluation of Hallucinations in Large Vision-Language Models,” in Workshop: Statistical Foundations of LLMs and Foundation Models (SFLLM 2024), Vancouver, Canada, 2024.
- PuRe
- BibTeX
90
Thesis
D2IMPR-CS
A. Abbas
“Efficient and Differentiable Combinatorial Optimization for Visual Computing,” Universität des Saarlandes, Saarbrücken, 2024.
91
Paper
D2
S. Agnihotri, J. Grabinski, J. Keuper, and M. Keuper
“Beware of Aliases -- Signal Preservation is Crucial for Robust Image Restoration,” 2024. [Online]. Available: https://arxiv.org/abs/2406.07435.
mehr
Abstract
Image restoration networks are usually comprised of an encoder and a decoder,
responsible for aggregating image content from noisy, distorted data and to
restore clean, undistorted images, respectively. Data aggregation as well as
high-resolution image generation both usually come at the risk of involving
aliases, i.e.~standard architectures put their ability to reconstruct the model
input in jeopardy to reach high PSNR values on validation data. The price to be
paid is low model robustness. In this work, we show that simply providing
alias-free paths in state-of-the-art reconstruction transformers supports
improved model robustness at low costs on the restoration performance. We do so
by proposing BOA-Restormer, a transformer-based image restoration model that
executes downsampling and upsampling operations partly in the frequency domain
to ensure alias-free paths along the entire model while potentially preserving
all relevant high-frequency information.
- PuRe
- BibTeX
92
Thesis
D2IMPR-CS
M. Böhle
“Towards Designing Inherently Interpretable Deep Neural Networks for Image Classification,” Universität des Saarlandes, Saarbrücken, 2024.
93
Paper
D2
C. Braunstein, H. Petekkaya, J. E. Lenssen, M. Toneva, and E. Ilg
“SLayR: Scene Layout Generation with Rectified Flow,” 2024. [Online]. Available: https://arxiv.org/abs/2412.05003.
mehr
Abstract
We introduce SLayR, Scene Layout Generation with Rectified flow.
State-of-the-art text-to-image models achieve impressive results. However, they
generate images end-to-end, exposing no fine-grained control over the process.
SLayR presents a novel transformer-based rectified flow model for layout
generation over a token space that can be decoded into bounding boxes and
corresponding labels, which can then be transformed into images using existing
models. We show that established metrics for generated images are inconclusive
for evaluating their underlying scene layout, and introduce a new benchmark
suite, including a carefully designed repeatable human-evaluation procedure
that assesses the plausibility and variety of generated layouts. In contrast to
previous works, which perform well in either high variety or plausibility, we
show that our approach performs well on both of these axes at the same time. It
is also at least 5x times smaller in the number of parameters and 37% faster
than the baselines. Our complete text-to-image pipeline demonstrates the added
benefits of an interpretable and editable intermediate representation.
- PuRe
- BibTeX
94
Paper
D2
J. Fischer and R. Ma
“Sailing in High-dimensional Spaces: Low-dimensional Embeddings through Angle Preservation,” 2024. [Online]. Available: https://arxiv.org/abs/2406.09876.
mehr
Abstract
Low-dimensional embeddings (LDEs) of high-dimensional data are ubiquitous in
science and engineering. They allow us to quickly understand the main
properties of the data, identify outliers and processing errors, and inform the
next steps of data analysis. As such, LDEs have to be faithful to the original
high-dimensional data, i.e., they should represent the relationships that are
encoded in the data, both at a local as well as global scale. The current
generation of LDE approaches focus on reconstructing local distances between
any pair of samples correctly, often out-performing traditional approaches
aiming at all distances. For these approaches, global relationships are,
however, usually strongly distorted, often argued to be an inherent trade-off
between local and global structure learning for embeddings. We suggest a new
perspective on LDE learning, reconstructing angles between data points. We show
that this approach, Mercat, yields good reconstruction across a diverse set of
experiments and metrics, and preserve structures well across all scales.
Compared to existing work, our approach also has a simple formulation,
facilitating future theoretical analysis and algorithmic improvements.
- PuRe
- BibTeX
95
Paper
D2
P. Gavrikov, J. Lukasik, S. Jung, R. Geirhos, B. Lamm, M. J. Mirza, M. Keuper, and J. Keuper
“Are Vision Language Models Texture or Shape Biased and Can We Steer Them?,” 2024. [Online]. Available: https://arxiv.org/abs/2403.09193.
mehr
Abstract
Vision language models (VLMs) have drastically changed the computer vision
model landscape in only a few years, opening an exciting array of new
applications from zero-shot image classification, over to image captioning, and
visual question answering. Unlike pure vision models, they offer an intuitive
way to access visual content through language prompting. The wide applicability
of such models encourages us to ask whether they also align with human vision -
specifically, how far they adopt human-induced visual biases through multimodal
fusion, or whether they simply inherit biases from pure vision models. One
important visual bias is the texture vs. shape bias, or the dominance of local
over global information. In this paper, we study this bias in a wide range of
popular VLMs. Interestingly, we find that VLMs are often more shape-biased than
their vision encoders, indicating that visual biases are modulated to some
extent through text in multimodal models. If text does indeed influence visual
biases, this suggests that we may be able to steer visual biases not just
through visual input but also through language: a hypothesis that we confirm
through extensive experiments. For instance, we are able to steer shape bias
from as low as 49% to as high as 72% through prompting alone. For now, the
strong human bias towards shape (96%) remains out of reach for all tested VLMs.
- PuRe
- BibTeX
96
Paper
D2
V. Guzov, I. A. Petrov, and G. Pons-Moll
“blendify – Python rendering framework for Blender,” 2024. [Online]. Available: https://arxiv.org/abs/2410.17858.
mehr
Abstract
With the rapid growth of the volume of research fields like computer vision
and computer graphics, researchers require effective and user-friendly
rendering tools to visualize results. While advanced tools like Blender offer
powerful capabilities, they also require a significant effort to master. This
technical report introduces Blendify, a lightweight Python-based framework that
seamlessly integrates with Blender, providing a high-level API for scene
creation and rendering. Blendify reduces the complexity of working with
Blender's native API by automating object creation, handling the colors and
material linking, and implementing features such as shadow-catcher objects
while maintaining support for high-quality ray-tracing rendering output. With a
focus on usability Blendify enables efficient and flexible rendering workflow
for rendering in common computer vision and computer graphics use cases. The
code is available at github.com/ptrvilya/blendify
- PuRe
- BibTeX
97
Paper
D2
N. Kister, I. Sárándi, A. Khoreva, and G. Pons-Moll
“Are Pose Estimators Ready for the Open World? STAGE: Synthetic Data Generation Toolkit for Auditing 3D Human Pose Estimators,” 2024. [Online]. Available: https://arxiv.org/abs/2408.16536.
mehr
Abstract
The estimation of 3D human poses from images has progressed tremendously over
the last few years as measured on standard benchmarks. However, performance in
the open world remains underexplored, as current benchmarks cannot capture its
full extent. Especially in safety-critical systems, it is crucial that 3D pose
estimators are audited before deployment, and their sensitivity towards single
factors or attributes occurring in the operational domain is thoroughly
examined. Nevertheless, we currently lack a benchmark that would enable such
fine-grained analysis. We thus present STAGE, a GenAI data toolkit for auditing
3D human pose estimators. We enable a text-to-image model to control the 3D
human body pose in the generated image. This allows us to create customized
annotated data covering a wide range of open-world attributes. We leverage
STAGE and generate a series of benchmarks to audit the sensitivity of popular
pose estimators towards attributes such as gender, ethnicity, age, clothing,
location, and weather. Our results show that the presence of such naturally
occurring attributes can cause severe degradation in the performance of pose
estimators and leads us to question if they are ready for open-world
deployment.
- PuRe
- BibTeX
98
Thesis
D2IMPR-CS
A. Kukleva
“Advancing Image and Video Recognition with Less Supervision,” Universität des Saarlandes, Saarbrücken, 2024.
mehr
Abstract
Deep learning is increasingly relevant in our daily lives, as it simplifies tedious tasks and enhances quality of life across various domains such as entertainment, learning, automatic assistance, and autonomous driving. However, the demand for more data to train models for emerging tasks is increasing dramatically. Deep learning models heavily depend on the quality and quantity of data, necessitating high-quality labeled datasets. Yet, each task requires different types of annotations for training and evaluation, posing challenges in obtaining comprehensive supervision. The acquisition of annotations is not only resource-intensive in terms of time and cost but also introduces biases, such as granularity in classification, where distinctions like specific breeds versus generic categories may arise. Furthermore, the dynamic nature of the world causes the challenge that previously annotated data becomes potentially irrelevant, and new categories and rare occurrences continually emerge, making it impossible to label every aspect of the world.
Therefore, this thesis aims to explore various supervision scenarios to mitigate the need for full supervision and reduce data acquisition costs. Specifically, we investigate learning without labels, referred to as self-supervised and unsupervised methods, to better understand video and image representations. To learn from data without labels, we leverage injected priors such as motion speed, direction, action order in videos, or semantic information granularity to obtain powerful data representations. Further, we study scenarios involving reduced supervision levels. To reduce annotation costs, first, we propose to omit precise annotations for one modality in multimodal learning, namely in text-video and image-video settings, and transfer available knowledge to large copora of video data. Second, we study semi-supervised learning scenarios, where only a subset of annotated data alongside unlabeled data is available, and propose to revisit regularization constraints and improve generalization to unlabeled data. Additionally, we address scenarios where parts of available data is inherently limited due to privacy and security reasons or naturally rare events, which not only restrict annotations but also limit the overall data volume. For these scenarios, we propose methods that carefully balance between previously obtained knowledge and incoming limited data by introducing a calibration method or combining a space reservation technique with orthogonality constraints. Finally, we explore multimodal and unimodal open-world scenarios where the model is asked to generalize beyond the given set of object or action classes. Specifically, we propose a new challenging setting on multimodal egocentric videos and propose an adaptation method for vision-language models to generalize on egocentric domain. Moreover, we study unimodal image recognition in an open-set setting and propose to disentangle open-set detection and image classification tasks that effectively improve generalization in different settings.
In summary, this thesis investigates challenges arising when full supervision for training models is not available. We develop methods to understand learning dynamics and the role of biases in data, while also proposing novel setups to advance training with less supervision.
99
Paper
D2
H. Li, A. Deng, Q. Ke, J. Liu, H. Rahmani, Y. Guo, B. Schiele, and C. Chen
“Sports-QA: A Large-Scale Video Question Answering Benchmark for Complex and Professional Sports,” 2024. [Online]. Available: https://arxiv.org/abs/2401.01505.
mehr
Abstract
Reasoning over sports videos for question answering is an important task with
numerous applications, such as player training and information retrieval.
However, this task has not been explored due to the lack of relevant datasets
and the challenging nature it presents. Most datasets for video question
answering (VideoQA) focus mainly on general and coarse-grained understanding of
daily-life videos, which is not applicable to sports scenarios requiring
professional action understanding and fine-grained motion analysis. In this
paper, we introduce the first dataset, named Sports-QA, specifically designed
for the sports VideoQA task. The Sports-QA dataset includes various types of
questions, such as descriptions, chronologies, causalities, and counterfactual
conditions, covering multiple sports. Furthermore, to address the
characteristics of the sports VideoQA task, we propose a new Auto-Focus
Transformer (AFT) capable of automatically focusing on particular scales of
temporal information for question answering. We conduct extensive experiments
on Sports-QA, including baseline studies and the evaluation of different
methods. The results demonstrate that our AFT achieves state-of-the-art
performance.
- PuRe
- BibTeX
100
Paper
D2
Y. Li, W. Beluch, M. Keuper, D. Zhang, and A. Khoreva
“VSTAR: Generative Temporal Nursing for Longer Dynamic Video Synthesis,” 2024. [Online]. Available: https://arxiv.org/abs/2403.13501.
mehr
Abstract
Despite tremendous progress in the field of text-to-video (T2V) synthesis,
open-sourced T2V diffusion models struggle to generate longer videos with
dynamically varying and evolving content. They tend to synthesize quasi-static
videos, ignoring the necessary visual change-over-time implied in the text
prompt. At the same time, scaling these models to enable longer, more dynamic
video synthesis often remains computationally intractable. To address this
challenge, we introduce the concept of Generative Temporal Nursing (GTN), where
we aim to alter the generative process on the fly during inference to improve
control over the temporal dynamics and enable generation of longer videos. We
propose a method for GTN, dubbed VSTAR, which consists of two key ingredients:
1) Video Synopsis Prompting (VSP) - automatic generation of a video synopsis
based on the original single prompt leveraging LLMs, which gives accurate
textual guidance to different visual states of longer videos, and 2) Temporal
Attention Regularization (TAR) - a regularization technique to refine the
temporal attention units of the pre-trained T2V diffusion models, which enables
control over the video dynamics. We experimentally showcase the superiority of
the proposed approach in generating longer, visually appealing videos over
existing open-sourced T2V models. We additionally analyze the temporal
attention maps realized with and without VSTAR, demonstrating the importance of
applying our method to mitigate neglect of the desired visual change over time.
- PuRe
- BibTeX
101
Paper
D2
T. Medi, A. Rampini, P. Reddy, P. K. Jayaraman, and M. Keuper
“3D-WAG: Hierarchical Wavelet-Guided Autoregressive Generation for High-Fidelity 3D Shapes,” 2024. [Online]. Available: https://arxiv.org/abs/2411.19037.
mehr
Abstract
Autoregressive (AR) models have achieved remarkable success in natural
language and image generation, but their application to 3D shape modeling
remains largely unexplored. Unlike diffusion models, AR models enable more
efficient and controllable generation with faster inference times, making them
especially suitable for data-intensive domains. Traditional 3D generative
models using AR approaches often rely on ``next-token" predictions at the voxel
or point level. While effective for certain applications, these methods can be
restrictive and computationally expensive when dealing with large-scale 3D
data. To tackle these challenges, we introduce 3D-WAG, an AR model for 3D
implicit distance fields that can perform unconditional shape generation,
class-conditioned and also text-conditioned shape generation. Our key idea is
to encode shapes as multi-scale wavelet token maps and use a Transformer to
predict the ``next higher-resolution token map" in an autoregressive manner. By
redefining 3D AR generation task as ``next-scale" prediction, we reduce the
computational cost of generation compared to traditional ``next-token"
prediction models, while preserving essential geometric details of 3D shapes in
a more structured and hierarchical manner. We evaluate 3D-WAG to showcase its
benefit by quantitative and qualitative comparisons with state-of-the-art
methods on widely used benchmarks. Our results show 3D-WAG achieves superior
performance in key metrics like Coverage and MMD, generating high-fidelity 3D
shapes that closely match the real data distribution.
- PuRe
- BibTeX
102
Paper
D2
T. Medi, J. Grabinski, and M. Keuper
“Towards Class-wise Robustness Analysis,” 2024. [Online]. Available: https://arxiv.org/abs/2411.19853.
mehr
Abstract
While being very successful in solving many downstream tasks, the application
of deep neural networks is limited in real-life scenarios because of their
susceptibility to domain shifts such as common corruptions, and adversarial
attacks. The existence of adversarial examples and data corruption
significantly reduces the performance of deep classification models.
Researchers have made strides in developing robust neural architectures to
bolster decisions of deep classifiers. However, most of these works rely on
effective adversarial training methods, and predominantly focus on overall
model robustness, disregarding class-wise differences in robustness, which are
critical. Exploiting weakly robust classes is a potential avenue for attackers
to fool the image recognition models. Therefore, this study investigates
class-to-class biases across adversarially trained robust classification models
to understand their latent space structures and analyze their strong and weak
class-wise properties. We further assess the robustness of classes against
common corruptions and adversarial attacks, recognizing that class
vulnerability extends beyond the number of correct classifications for a
specific class. We find that the number of false positives of classes as
specific target classes significantly impacts their vulnerability to attacks.
Through our analysis on the Class False Positive Score, we assess a fair
evaluation of how susceptible each class is to misclassification.
- PuRe
- BibTeX
103
Paper
D2
T. Medi, S. Jung, and M. Keuper
“FAIR-TAT: Improving Model Fairness Using Targeted Adversarial Training,” 2024. [Online]. Available: https://arxiv.org/abs/2410.23142.
mehr
Abstract
Deep neural networks are susceptible to adversarial attacks and common
corruptions, which undermine their robustness. In order to enhance model
resilience against such challenges, Adversarial Training (AT) has emerged as a
prominent solution. Nevertheless, adversarial robustness is often attained at
the expense of model fairness during AT, i.e., disparity in class-wise
robustness of the model. While distinctive classes become more robust towards
such adversaries, hard to detect classes suffer. Recently, research has focused
on improving model fairness specifically for perturbed images, overlooking the
accuracy of the most likely non-perturbed data. Additionally, despite their
robustness against the adversaries encountered during model training,
state-of-the-art adversarial trained models have difficulty maintaining
robustness and fairness when confronted with diverse adversarial threats or
common corruptions. In this work, we address the above concerns by introducing
a novel approach called Fair Targeted Adversarial Training (FAIR-TAT). We show
that using targeted adversarial attacks for adversarial training (instead of
untargeted attacks) can allow for more favorable trade-offs with respect to
adversarial fairness. Empirical results validate the efficacy of our approach.
- PuRe
- BibTeX
104
Paper
D2
M. Segu, L. Piccinelli, S. Li, Y.-H. Yang, B. Schiele, and L. Van Gool
“Samba: Synchronized Set-of-Sequences Modeling for Multiple Object Tracking,” 2024. [Online]. Available: https://arxiv.org/abs/2410.01806.
mehr
Abstract
Multiple object tracking in complex scenarios - such as coordinated dance
performances, team sports, or dynamic animal groups - presents unique
challenges. In these settings, objects frequently move in coordinated patterns,
occlude each other, and exhibit long-term dependencies in their trajectories.
However, it remains a key open research question on how to model long-range
dependencies within tracklets, interdependencies among tracklets, and the
associated temporal occlusions. To this end, we introduce Samba, a novel
linear-time set-of-sequences model designed to jointly process multiple
tracklets by synchronizing the multiple selective state-spaces used to model
each tracklet. Samba autoregressively predicts the future track query for each
sequence while maintaining synchronized long-term memory representations across
tracklets. By integrating Samba into a tracking-by-propagation framework, we
propose SambaMOTR, the first tracker effectively addressing the aforementioned
issues, including long-range dependencies, tracklet interdependencies, and
temporal occlusions. Additionally, we introduce an effective technique for
dealing with uncertain observations (MaskObs) and an efficient training recipe
to scale SambaMOTR to longer sequences. By modeling long-range dependencies and
interactions among tracked objects, SambaMOTR implicitly learns to track
objects accurately through occlusions without any hand-crafted heuristics. Our
approach significantly surpasses prior state-of-the-art on the DanceTrack, BFT,
and SportsMOT datasets.
- PuRe
- BibTeX
105
Paper
D2D6
H. Wang, M. Mendiratta, C. Theobalt, and A. Kortylewski
“FaceGPT: Self-supervised Learning to Chat about 3D Human Faces,” 2024. [Online]. Available: https://arxiv.org/abs/2406.07163.
mehr
Abstract
We introduce FaceGPT, a self-supervised learning framework for Large
Vision-Language Models (VLMs) to reason about 3D human faces from images and
text. Typical 3D face reconstruction methods are specialized algorithms that
lack semantic reasoning capabilities. FaceGPT overcomes this limitation by
embedding the parameters of a 3D morphable face model (3DMM) into the token
space of a VLM, enabling the generation of 3D faces from both textual and
visual inputs. FaceGPT is trained in a self-supervised manner as a model-based
autoencoder from in-the-wild images. In particular, the hidden state of LLM is
projected into 3DMM parameters and subsequently rendered as 2D face image to
guide the self-supervised learning process via image-based reconstruction.
Without relying on expensive 3D annotations of human faces, FaceGPT obtains a
detailed understanding about 3D human faces, while preserving the capacity to
understand general user instructions. Our experiments demonstrate that FaceGPT
not only achieves high-quality 3D face reconstructions but also retains the
ability for general-purpose visual instruction following. Furthermore, FaceGPT
learns fully self-supervised to generate 3D faces based on complex textual
inputs, which opens a new direction in human face analysis.
- PuRe
- BibTeX
106
Paper
D2
Y. Wu, X. Hu, Y. Sun, Y. Zhou, W. Zhu, F. Rao, B. Schiele, and X. Yang
“Number it: Temporal Grounding Videos like Flipping Manga,” 2024. [Online]. Available: https://arxiv.org/abs/2411.10332.
mehr
Abstract
Video Large Language Models (Vid-LLMs) have made remarkable advancements in
comprehending video content for QA dialogue. However, they struggle to extend
this visual understanding to tasks requiring precise temporal localization,
known as Video Temporal Grounding (VTG). To address this gap, we introduce
Number-Prompt (NumPro), a novel method that empowers Vid-LLMs to bridge visual
comprehension with temporal grounding by adding unique numerical identifiers to
each video frame. Treating a video as a sequence of numbered frame images,
NumPro transforms VTG into an intuitive process: flipping through manga panels
in sequence. This allows Vid-LLMs to "read" event timelines, accurately linking
visual content with corresponding temporal information. Our experiments
demonstrate that NumPro significantly boosts VTG performance of top-tier
Vid-LLMs without additional computational cost. Furthermore, fine-tuning on a
NumPro-enhanced dataset defines a new state-of-the-art for VTG, surpassing
previous top-performing methods by up to 6.9\% in mIoU for moment retrieval and
8.5\% in mAP for highlight detection. The code will be available at
github.com/yongliang-wu/NumPro.
- PuRe
- BibTeX
107
Paper
D2
X. Zhang, S. Starke, V. Guzov, Z. Zhang, E. P. Pellitero, and G. Pons-Moll
“SCENIC: Scene-aware Semantic Navigation with Instruction-guided Control,” 2024. [Online]. Available: https://arxiv.org/abs/2412.15664.
mehr
Abstract
Synthesizing natural human motion that adapts to complex environments while
allowing creative control remains a fundamental challenge in motion synthesis.
Existing models often fall short, either by assuming flat terrain or lacking
the ability to control motion semantics through text. To address these
limitations, we introduce SCENIC, a diffusion model designed to generate human
motion that adapts to dynamic terrains within virtual scenes while enabling
semantic control through natural language. The key technical challenge lies in
simultaneously reasoning about complex scene geometry while maintaining text
control. This requires understanding both high-level navigation goals and
fine-grained environmental constraints. The model must ensure physical
plausibility and precise navigation across varied terrain, while also
preserving user-specified text control, such as ``carefully stepping over
obstacles" or ``walking upstairs like a zombie." Our solution introduces a
hierarchical scene reasoning approach. At its core is a novel scene-dependent,
goal-centric canonicalization that handles high-level goal constraint, and is
complemented by an ego-centric distance field that captures local geometric
details. This dual representation enables our model to generate physically
plausible motion across diverse 3D scenes. By implementing frame-wise text
alignment, our system achieves seamless transitions between different motion
styles while maintaining scene constraints. Experiments demonstrate our novel
diffusion model generates arbitrarily long human motions that both adapt to
complex scenes with varying terrain surfaces and respond to textual prompts.
Additionally, we show SCENIC can generalize to four real-scene datasets. Our
code, dataset, and models will be released at
\url{https://virtualhumans.mpi-inf.mpg.de/scenic/}.
- PuRe
- BibTeX
108
Paper
D2
Y. Zhou, M. Keuper, and M. Fritz
“Balancing Diversity and Risk in LLM Sampling: How to Select Your Method and Parameter for Open-Ended Text Generation,” 2024. [Online]. Available: https://arxiv.org/abs/2408.13586.
mehr
Abstract
Sampling-based decoding strategies have been widely adopted for Large
Language Models (LLMs) in numerous applications, which target a balance between
diversity and quality via temperature tuning and tail truncation (e.g., top-k
and top-p sampling). Considering the high dynamic range of the candidate
next-token given different prefixes, recent studies propose to adaptively
truncate the tail of LLM's predicted distribution. Although improved results
haven been reported with these methods on open-ended text generation tasks, the
results are highly dependent on the curated truncation parameters and exemplar
text. In this paper, we propose a systematic way to estimate the intrinsic
capacity of a truncation sampling method by considering the trade-off between
diversity and risk at each decoding step, based on our collected prefix tree
which preserves the context of a full sentence. Our work provides a
comprehensive comparison between existing truncation sampling methods, as well
as their recommended parameters as a guideline for users.
- PuRe
- BibTeX

2023

109
Conference paper
D2
Y. Li, M. Keuper, D. Zhang, and A. Khoreva
“Divide & Bind Your Attention for Improved Generative Semantic Nursing,” in 34th British Machine Vision Conference (BMVC 2023), Aberdeen, UK, 2023.
- PuRe
- BibTeX
110
Conference paper
D2
Z. Luo, Y. Liu, B. Schiele, and Q. Sun
“Class-Incremental Exemplar Compression for Class-Incremental Learning,” in 36th IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2023), Vancouver, Canada, 2023.
111
Conference paper
D2
A. Chaudhuri, M. Mancini, Z. Akata, and A. Dutta
“Transitivity Recovering Decompositions: Interpretable and Robust Fine-Grained Relationships,” in Advances in Neural Information Processing Systems 36 (NeurIPS 2023), New Orleans, LA, USA, 2023.
- PuRe
- BibTeX
112
Conference paper
D2
D. M. H. Nguyen, H. Nguyen, N. Diep, T. N. Pham, T. Cao, B. Nguyen, P. Swoboda, N. Ho, S. Albarqouni, P. Xie, D. Sonntag, and M. Niepert
“LVM-Med: Learning Large-Scale Self-Supervised Vision Models for Medical Imaging via Second-order Graph Matching,” in Advances in Neural Information Processing Systems 36 (NeurIPS 2023), New Orleans, LA, USA, 2023.
- PuRe
- BibTeX
113
Conference paper
D2
J. Lukasik, J. Geiping, M. Moeller, and M. Keuper
“Differentiable Architecture Search: a One-Shot Method?,” in AutoML Conference 2023, Potsdam/Berlin, Germany, 2023.
- PuRe
- BibTeX
114
Article
D2
B. Andres, S. Di Gregorio, J. Irmai, and J.-H. Lange
“A Polyhedral Study of Lifted Multicuts,” Discrete Optimization, vol. 47, 2023.
115
Conference paper
D2
H. Chen, R. Tao, Y. Fan, Y. Wang, J. Wang, B. Schiele, X. Xie, B. Raj, and M. Savvides
“SoftMatch: Addressing the Quantity-Quality Tradeoff in Semi-supervised Learning,” in Eleventh International Conference on Learning Representations (ICLR 2023), Kigali, Rwanda, 2023.
- PuRe
- BibTeX
116
Conference paper
D2
Q. Fan, M. Segu, Y.-W. Tai, F. Yu, C.-K. Tang, B. Schiele, and D. Dai
“Towards Robust Object Detection Invariant to Real-World Domain Shifts,” in Eleventh International Conference on Learning Representations (ICLR 2023), Kigali, Rwanda, 2023.
- PuRe
- BibTeX
117
Conference paper
D2
S. Jung, J. Lukasik, and M. Keuper
“Neural Architecture Design and Robustness: A Dataset,” in Eleventh International Conference on Learning Representations (ICLR 2023), Kigali, Rwanda, 2023.
mehr
Abstract
Deep learning models have proven to be successful in a wide
range of machine learning tasks. Yet, they are often highly sensitive to
perturbations on the input data which can lead to incorrect decisions
with high confidence, hampering their deployment for practical
use-cases. Thus, finding architectures that are (more) robust against
perturbations has received much attention in recent years. Just like the
search for well-performing architectures in terms of clean accuracy,
this usually involves a tedious trial-and-error process with one
additional challenge: the evaluation of a network's robustness is
significantly more expensive than its evaluation for clean accuracy.
Thus, the aim of this paper is to facilitate better streamlined research
on architectural design choices with respect to their impact on
robustness as well as, for example, the evaluation of surrogate measures
for robustness. We therefore borrow one of the most commonly considered
search spaces for neural architecture search for image classification,
NAS-Bench-201, which contains a manageable size of 6466 non-isomorphic
network designs. We evaluate all these networks on a range of common
adversarial attacks and corruption types and introduce a database on
neural architecture design and robustness evaluations. We further
present three exemplary use cases of this dataset, in which we (i)
benchmark robustness measurements based on Jacobian and Hessian matrices
for their robustness predictability, (ii) perform neural architecture
search on robust accuracies, and (iii) provide an initial analysis of
how architectural design choices affect robustness. We find that
carefully crafting the topology of a network can have substantial impact
on its robustness, where networks with the same parameter count range in
mean adversarial robust accuracy from 20%-41%.
- PuRe
- BibTeX
118
Conference paper
D2
A. Kukleva, M. Boehle, B. Schiele, H. Kuehne, and C. Rupprecht
“Temperature Schedules for Self-Supervised Contrastive Methods on Long-Tail Data,” in Eleventh International Conference on Learning Representations (ICLR 2023), Kigali, Rwanda, 2023.
- PuRe
- BibTeX
119
Conference paper
D2
Y. Wang, H. Chen, Q. Heng, W. Hou, Y. Fan, Z. Wu, J. Wang, M. Savvides, T. Shinozaki, B. Raj, B. Schiele, and X. Xie
“FreeMatch: Self-adaptive Thresholding for Semi-supervised Learning,” in Eleventh International Conference on Learning Representations (ICLR 2023), Kigali, Rwanda, 2023.
- PuRe
- BibTeX
120
Conference paper
D2
X. Hong, V. Demberg, A. Sayeed, Q. Zheng, and B. Schiele
“Visual Coherence Loss for Coherent and Visually Grounded Story Generation,” in Findings of the Association for Computational Linguistics (ACL 2023), Toronto, Canada, 2023.
mehr
Abstract
Local coherence is essential for long-form text generation models. We identify two important aspects of local coherence within the visual storytelling task: (1) the model needs to represent re-occurrences of characters within the image sequence in order to mention them correctly in the story; (2) character representations should enable us to find instances of the same characters and distinguish different characters. In this paper, we propose a loss function inspired by a linguistic theory of coherence for self-supervised learning for image sequence representations. We further propose combining features from an object and a face detector to construct stronger character features. To evaluate input-output relevance that current reference-based metrics don't measure, we propose a character matching metric to check whether the models generate referring expressions correctly for characters in input image sequences. Experiments on a visual story generation dataset show that our proposed features and loss function are effective for generating more coherent and visually grounded stories.
121
Conference paper
D2
A. Das, Y. Xian, D. Dai, and B. Schiele
“Weakly-Supervised Domain Adaptive Semantic Segmentation With Prototypical Contrastive Learning,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2023), Vancouver, Canada, 2023.
122
Conference paper
D2
J. Ding, N. Xue, G.-S. Xia, B. Schiele, and D. Dai
“HGFormer: Hierarchical Grouping Transformer for Domain Generalized Semantic Segmentation,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2023), Vancouver, Canada, 2023.
123
Conference paper
D2
J. Dong, D. Zhang, Y. Cong, W. Cong, H. Ding, and D. Dai
“Federated Incremental Semantic Segmentation,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2023), Vancouver, Canada, 2023.
124
Conference paper
D2
R. Gong, Q. Wang, M. Danelljan, D. Dai, and L. Van Gool
“Continuous Pseudo-Label Rectified Domain Adaptive Semantic Segmentation With Implicit Neural Representations,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2023), Vancouver, Canada, 2023.
125
Conference paper
D2
Y. Guo, D. Stutz, and B. Schiele
“Improving Robustness of Vision Transformers by Reducing Sensitivity To Patch Corruptions,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2023), Vancouver, Canada, 2023.
126
Conference paper
D2
L. Hoyer, D. Dai, H. Wang, and L. Van Gool
“MIC: Masked Image Consistency for Context-Enhanced Domain Adaptation,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2023), Vancouver, Canada, 2023.
127
Conference paper
D2
A. Jain, G. Swaminathan, P. Favaro, H. Yang, A. Ravichandran, H. Harutyunyan, A. Achille, O. Dabeer, B. Schiele, A. Swaminathan, and S. Soatto
“A Meta-Learning Approach to Predicting Performance and Data Requirements,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2023), Vancouver, Canada, 2023.
128
Conference paper
D2D6
L. Jiang, Z. Yang, S. Shi, V. Golyanik, D. Dai, and B. Schiele
“Self-Supervised Pre-Training With Masked Shape Prediction for 3D Scene Understanding,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2023), Vancouver, Canada, 2023.
129
Conference paper
D2
Y. Liu, B. Schiele, A. Vedaldi, and C. Rupprecht
“Continual Detection Transformer for Incremental Object Detection,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2023), Vancouver, Canada, 2023.
130
Conference paper
D2
I. A. Petrov, R. Marin, J. Chibane, and G. Pons-Moll
“Object Pop-Up: Can We Infer 3D Objects and their Poses from Human Interactions Alone?,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2023), Vancouver, Canada, 2023.
131
Conference paper
D2
H. Wang, C. Shi, S. Shi, M. Lei, S. Wang, D. He, B. Schiele, and L. Wang
“DSVT: Dynamic Sparse Voxel Transformer With Rotated Sets,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2023), Vancouver, Canada, 2023.
132
Conference paper
D2
H. Wu, C. Wen, S. Shi, X. Li, and C. Wang
“Virtual Sparse Convolution for Multimodal 3D Object Detection,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2023), Vancouver, Canada, 2023.
133
Conference paper
D2
X. Xie, B. L. Bhatnagar, and G. Pons-Moll
“Visibility Aware Human-Object Interaction Tracking from Single RGB Camera,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2023), Vancouver, Canada, 2023.
134
Conference paper
D2
B. Zhu, Z. Wang, S. Shi, H. Xu, L. Hong, and H. Li
“ConQueR: Query Contrast Voxel-DETR for 3D Object Detection,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2023), Vancouver, Canada, 2023.
135
Conference paper
D2
X. Chen, S. Shi, C. Zhang, B. Zhu, Q. Wang, K. C. Cheung, S. See, and H. Li
“TrajectoryFormer: 3D Object Tracking Transformer with Predictive Trajectory Hypotheses,” in IEEE/CVF International Conference on Computer Vision (ICCV 2023), Paris, France, 2023.
136
Conference paper
D2
Y. Fan, A. Kukleva, D. Dai, and B. Schiele
“SSB: Simple but Strong Baseline for Boosting Performance of Open-Set Semi-Supervised Learning,” in IEEE/CVF International Conference on Computer Vision (ICCV 2023), Paris, France, 2023.
137
Conference paper
D2
Y. Guo, D. Stutz, and B. Schiele
“Robustifying Token Attention for Vision Transformers,” in IEEE/CVF International Conference on Computer Vision (ICCV 2023), Paris, France, 2023.
138
Conference paper
D2
S. Rao, M. Böhle, A. Parchami-Araghi, and B. Schiele
“Studying How to Efficiently and Effectively Guide Models with Explanations,” in IEEE/CVF International Conference on Computer Vision (ICCV 2023), Paris, France, 2023.
139
Conference paper
D2
M. Segu, B. Schiele, and F. Yu
“DARTH: Holistic Test-time Adaptation for Multiple Object Tracking,” in IEEE/CVF International Conference on Computer Vision (ICCV 2023), Paris, France, 2023.
140
Conference paper
D2
N. Shvetsova, A. Kukleva, B. Schiele, and H. Kuehne
“In-Style: Bridging Text and Uncurated Videos with Style Transfer for Text-Video Retrieval,” in IEEE/CVF International Conference on Computer Vision (ICCV 2023), Paris, France, 2023.
141
Conference paper
D2
N. Shvetsova, F. Petersen, A. Kukleva, B. Schiele, and H. Kuehne
“Learning by Sorting: Self-supervised Learning with Group Ordering Constraints,” in IEEE/CVF International Conference on Computer Vision (ICCV 2023), Paris, France, 2023.
142
Conference paper
D2
H. Wang, H. Tang, S. Shi, A. Li, Z. Li, B. Schiele, and L. Wang
“UniTR: A Unified and Efficient Multi-Modal Transformer for Bird’s-Eye-View Representation,” in IEEE/CVF International Conference on Computer Vision (ICCV 2023), Paris, France, 2023.
143
Conference paper
D2
C. Wewer, E. Ilg, B. Schiele, and J. E. Lenssen
“SimNP: Learning Self-Similarity Priors Between Neural Points,” in IEEE/CVF International Conference on Computer Vision (ICCV 2023), Paris, France, 2023.
144
Conference paper
D2
Y. Xue, B. L. Bhatnagar, R. Marin, N. Sarafianos, Y. Xu, G. Pons-Moll, and T. Tung
“NSF: Neural Surface Fields for Human Modeling from Monocular Depth,” in IEEE/CVF International Conference on Computer Vision (ICCV 2023), Paris, France, 2023.
145
Conference paper
D2
S. Agnihotri, K. V. Gandikota, J. Grabinski, P. Chandramouli, and M. Keuper
“On the Unreasonable Vulnerability of Transformers for Image Restoration – and an Easy Fix,” in IEEE/CVF International Conference on Computer Vision Workshops (ICCVW 2023), Paris, France, 2023.
146
Conference paper
D2
P. Müller, A. Braun, and M. Keuper
“Classification Robustness to Common Optical Aberrations,” in IEEE/CVF International Conference on Computer Vision Workshops (ICCVW 2023), Paris, France, 2023.
147
Conference paper
D2
T. Broedermann, C. Sakaridis, D. Dai, and L. Van Gool
“HRFuser: A Multi-resolution Sensor Fusion Architecture for 2D Object Detection,” in IEEE 26th International Conference on Intelligent Transportation Systems (ITSC), Bilbao, Spain, 2023.
148
Conference paper
D2
Z. Li, S. Shi, B. Schiele, and D. Dai
“Test-time Domain Adaptation for Monocular Depth Estimation,” in IEEE International Conference on Robotics and Automation (ICRA 2023), London, UK, 2023.
149
Conference paper
D2
Z. Zhang, A. Liniger, D. Dai, F. Yu, and L. V. Gool
“TrafficBots: Towards World Models for Autonomous Driving Simulation and Motion Prediction,” in IEEE International Conference on Robotics and Automation (ICRA 2023), London, UK, 2023.
150
Article
D2
M. Böhle, M. Fritz, and B. Schiele
“Optimising for Interpretability: Convolutional Dynamic Alignment Networks,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 45, no. 6, 2023.
151
Article
D2
E. Corona, G. Alenyà, G. Pons-Moll, and F. Moreno-Noguer
“LayerNet: High-Resolution Semantic 3D Reconstruction of Clothed People,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 46, no. 2, 2023.
152
Article
D2
D. Dai, A. B. Vasudevan, J. Matas, and L. Van Gool
“Binaural SoundNet: Predicting Semantics, Depth and Motion with Binaural Sounds,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 45, no. 1, 2023.
153
Article
D6D4D2
M. Habermann, W. Xu, M. Zollhöfer, G. Pons-Moll, and C. Theobalt
“A Deeper Look into DeepCap,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 45, no. 4, 2023.
mehr
Abstract
Human performance capture is a highly important computer vision problem with
many applications in movie production and virtual/augmented reality. Many
previous performance capture approaches either required expensive multi-view
setups or did not recover dense space-time coherent geometry with
frame-to-frame correspondences. We propose a novel deep learning approach for
monocular dense human performance capture. Our method is trained in a weakly
supervised manner based on multi-view supervision completely removing the need
for training data with 3D ground truth annotations. The network architecture is
based on two separate networks that disentangle the task into a pose estimation
and a non-rigid surface deformation step. Extensive qualitative and
quantitative evaluations show that our approach outperforms the state of the
art in terms of quality and robustness. This work is an extended version of
DeepCap where we provide more detailed explanations, comparisons and results as
well as applications.
154
Article
D2
E. Levinkov, A. Kardoost, B. Andres, and M. Keuper
“Higher-Order Multicuts for Geometric Model Fitting and Motion Segmentation,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 45, no. 1, 2023.
mehr
Abstract
Minimum cost lifted multicut problem is a generalization of the multicut problem and is a means to optimizing a decomposition of a graph w.r.t. both positive and negative edge costs. Its main advantage is that multicut-based formulations do not require the number of components given a priori; instead, it is deduced from the solution. However, the standard multicut cost function is limited to pairwise relationships between nodes, while several important applications either require or can benefit from a higher-order cost function, i.e. hyper-edges. In this paper, we propose a pseudo-boolean formulation for a multiple model fitting problem. It is based on a formulation of any-order minimum cost lifted multicuts, which allows to partition an undirected graph with pairwise connectivity such as to minimize costs defined over any set of hyper-edges. As the proposed formulation is NP-hard and the branch-and-bound algorithm is too slow in practice, we propose an efficient local search algorithm for inference into resulting problems. We demonstrate versatility and effectiveness of our approach in several applications: geometric multiple model fitting, homography and motion estimation, motion segmentation.
155
Article
D2
D. Stutz, N. Chandramoorthy, M. Hein, and B. Schiele
“Random and Adversarial Bit Error Robustness: Energy-Efficient and Secure DNN Accelerators,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 45, no. 3, 2023.
156
Article
D2
D. Tome, T. Alldieck, P. Peluse, G. Pons-Moll, L. Agapito, H. Badino, and F. de la Torre
“SelfPose: 3D Egocentric Pose Estimation from a Headset Mounted Camera,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 45, no. 6, 2023.
157
Conference paper
D2
A. Das, Y. Xian, Y. He, Z. Akata, and B. Schiele
“Urban Scene Semantic Segmentation With Low-Cost Coarse Annotation,” in 2023 IEEE Winter Conference on Applications of Computer Vision (WACV 2023), Waikoloa, HI, USA, 2023.
158
Conference paper
D2
V. Lazova, V. Guzov, K. Olszewski, S. Tulyakov, and G. Pons-Moll
“Control-NeRF: Editable Feature Volumes for Scene Rendering and Manipulation,” in 2023 IEEE Winter Conference on Applications of Computer Vision (WACV 2023), Waikoloa, HI, USA, 2023.
159
Conference paper
D2
K. Li, D. Dai, and L. Van Gool
“Jointly Learning Band Selection and Filter Array Design for Hyperspectral Imaging,” in 2023 IEEE Winter Conference on Applications of Computer Vision (WACV 2023), Waikoloa, HI, USA, 2023.
160
Conference paper
D2
Y. Li, D. Zhang, M. Keuper, and A. Khoreva
“Intra-Source Style Augmentation for Improved Domain Generalization,” in 2023 IEEE Winter Conference on Applications of Computer Vision (WACV 2023), Waikoloa, HI, USA, 2023.
161
Article
D2
Y. Fan, A. Kukleva, D. Dai, and B. Schiele
“Revisiting Consistency Regularization for Semi-supervised Learning,” International Journal of Computer Vision, vol. 131, 2023.
162
Article
D2
L. Hoyer, D. Dai, Q. Wang, Y. Chen, and L. Van Gool
“Improving Semi-Supervised and Domain-Adaptive Semantic Segmentation with Self-Supervised Depth Estimation,” International Journal of Computer Vision, vol. 131, 2023.
163
Article
D2
J. Mao, S. Shi, X. Wang, and H. Li
“3D Object Detection for Autonomous Driving: A Comprehensive Survey,” International Journal of Computer Vision, vol. 131, 2023.
164
Article
D2
V. Kostyukhin, M. Keuper, I. Ibragimov, N. Owtscharenko, and M. Cristinziani
“Improving Primary-Vertex Reconstruction with a Minimum-Cost Lifted Multicut Graph Partitioning Algorithm,” Journal of Instrumentation, vol. 18, 2023.
165
Conference paper
D2
K. Prasse, S. Jung, I. B. Bravo, S. Walter, and M. Keuper
“Towards Understanding Climate Change Perceptions: A Social Media Dataset,” in NeurIPS 2023 Workshop on Tackling Climate Change with Machine Learning, New Orleans, LA, USA, 2023.
- PuRe
- BibTeX
166
Article
D2
J. Xi, J. Huang, S. Zheng, Q. Zhou, B. Schiele, X.-S. Hua, and Q. Sun
“Learning Comprehensive Global Features in Person Re-identification: Ensuring Discriminativeness of more Local Regions,” Pattern Recognition, vol. 134, 2023.
167
Conference paper
D2
M. Losch, D. Stutz, B. Schiele, and M. Fritz
“Certified Robust Models with Slack Control and Large Lipschitz Constants,” in Pattern Recognition (DAGM GCPR 2023), Heidelberg, Germany, 2023.
168
Conference paper
D2
J. Lukasik, M. Moeller, and M. Keuper
“An Evaluation of Zero-Cost Proxies - From Neural Architecture Performance Prediction to Model Robustness,” in Pattern Recognition (DAGM GCPR 2023), Heidelberg, Germany, 2023.
169
Conference paper
D2
T. Medi, J. Tayyub, M. Sarmad, F. Lindseth, and M. Keuper
“FullFormer: Generating Shapes Inside Shapes,” in Pattern Recognition (DAGM GCPR 2023), Heidelberg, Germany, 2023.
170
Conference paper
D2
P. Lorenz, M. Keuper, and J. Keuper
“Unfolding Local Growth Rate Estimates for (Almost) Perfect Adversarial Detection,” in Proceedings of the 18th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications. - Vol. 5, VISAPP (VISIGRAPP 2023), Lisbon, Portugal, 2023.
171
Conference paper
D2
Y. Liu, Y. Li, B. Schiele, and Q. Sun
“Online Hyperparameter Optimization for Class-Incremental Learning,” in Proceedings of the 37th AAAI Conference on Artificial Intelligence, Washington, DC, USA, 2023.
172
Conference paper
D2
D. M. H. Nguyen, H. Nguyen, M. T. N. Truong, T. Cao, B. T. Nguyen, N. Ho, P. Swoboda, S. Albarqouni, P. Xie, and D. Sonntag
“Joint Self-Supervised Image-Volume Representation Learning with Intra-Inter Contrastive Clustering,” in Proceedings of the 37th AAAI Conference on Artificial Intelligence, Washington, DC, USA, 2023.
173
Conference paper
D2
Z. Tian, J. Cui, L. Jiang, X. Qi, X. Lai, Y. Chen, S. Liu, and J. Jia
“Learning Context-Aware Classifier for Semantic Segmentation,” in Proceedings of the 37th AAAI Conference on Artificial Intelligence, Washington, DC, USA, 2023.
174
Conference paper
D2
A. Abbas and P. Swoboda
“ClusterFuG: Clustering Fully connected Graphs by Multicut,” in Proceedings of the 40th International Conference on Machine Learning (ICML 2023), Honolulu, Hawaii, USA, 2023.
- PuRe
- BibTeX
175
Conference paper
D2
P. Gavrikov, J. Keuper, and M. Keuper
“An Extended Study of Human-like Behavior under Adversarial Training,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW 2023), Vancouver, Canada, 2023.
176
Conference paper
D2
E. Schönfeld, J. Borges, V. Sushko, B. Schiele, and A. Khoreva
“Discovering Class-Specific GAN Controls for Semantic Image Synthesis,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW 2023), Vancouver, Canada, 2023.
177
Article
D2
X. Hong, A. Sayeed, K. Mehra, V. Demberg, and B. Schiele
“Visual Writing Prompts: Character-Grounded Story Generation with Curated Image Sequences,” Transactions of the Association for Computational Linguistics, vol. 11, 2023.
178
Article
D2
J. Lukasik, P. Gavrikov, J. Keuper, and M. Keuper
“Improving Native CNN Robustness with Filter Frequency Regularization,” Transactions on Machine Learning Research, vol. 2023, 2023.
- PuRe
- BibTeX
179
Conference paper
D2
J. P. Schneider, F. Mishal, J. Lukasik, A. Kolb, M. Keuper, and M. Moeller
“Implicit Representations for Image Segmentation,” in UniReps: The First Workshop on Unifying Representations in Neural Models, New Orleans, LA, USA, 2022.
- PuRe
- BibTeX
180
Conference paper
D2D6
S. Jung, J. C. Schwedhelm, C. Schillings, and M. Keuper
“Happy People --Image Synthesis as Black-Box Optimization Problem in the Discrete Latent Space of Deep Generative Models,” in Workshop Generative Models for Computer Vision, Vancouver, Canada, 2023.
- PuRe
- BibTeX
181
Thesis
D2D6
B. L. Bhatnagar
“Modelling 3D Humans : Pose, Shape, Clothing and Interactions,” Universität des Saarlandes, Saarbrücken, 2023.
182
Paper
D2
M. Böhle, M. Fritz, and B. Schiele
“Holistically Explainable Vision Transformers,” 2023. [Online]. Available: https://arxiv.org/abs/2301.08669.
mehr
Abstract
Transformers increasingly dominate the machine learning landscape across many
tasks and domains, which increases the importance for understanding their
outputs. While their attention modules provide partial insight into their inner
workings, the attention scores have been shown to be insufficient for
explaining the models as a whole. To address this, we propose B-cos
transformers, which inherently provide holistic explanations for their
decisions. Specifically, we formulate each model component - such as the
multi-layer perceptrons, attention layers, and the tokenisation module - to be
dynamic linear, which allows us to faithfully summarise the entire transformer
via a single linear transform. We apply our proposed design to Vision
Transformers (ViTs) and show that the resulting models, dubbed Bcos-ViTs, are
highly interpretable and perform competitively to baseline ViTs on ImageNet.
Code will be made available soon.
- PuRe
- BibTeX
183
Paper
D2
M. Fey, W. Hu, K. Huang, J. E. Lenssen, R. Ranjan, J. Robinson, R. Ying, J. You, and J. Leskovec
“Relational Deep Learning: Graph Representation Learning on Relational Databases,” 2023. [Online]. Available: https://arxiv.org/abs/2312.04615.
mehr
Abstract
Much of the world's most valued data is stored in relational databases and
data warehouses, where the data is organized into many tables connected by
primary-foreign key relations. However, building machine learning models using
this data is both challenging and time consuming. The core problem is that no
machine learning method is capable of learning on multiple tables
interconnected by primary-foreign key relations. Current methods can only learn
from a single table, so the data must first be manually joined and aggregated
into a single training table, the process known as feature engineering. Feature
engineering is slow, error prone and leads to suboptimal models. Here we
introduce an end-to-end deep representation learning approach to directly learn
on data laid out across multiple tables. We name our approach Relational Deep
Learning (RDL). The core idea is to view relational databases as a temporal,
heterogeneous graph, with a node for each row in each table, and edges
specified by primary-foreign key links. Message Passing Graph Neural Networks
can then automatically learn across the graph to extract representations that
leverage all input data, without any manual feature engineering. Relational
Deep Learning leads to more accurate models that can be built much faster. To
facilitate research in this area, we develop RelBench, a set of benchmark
datasets and an implementation of Relational Deep Learning. The data covers a
wide spectrum, from discussions on Stack Exchange to book reviews on the Amazon
Product Catalog. Overall, we define a new research area that generalizes graph
machine learning and broadens its applicability to a wide set of AI use cases.
- PuRe
- BibTeX
184
Paper
D2
J. Grabinski, J. Keuper, and M. Keuper
“Fix your downsampling ASAP! Be natively more robust via Aliasing and Spectral Artifact free Pooling,” 2023. [Online]. Available: https://arxiv.org/abs/2307.09804.
mehr
Abstract
Convolutional neural networks encode images through a sequence of
convolutions, normalizations and non-linearities as well as downsampling
operations into potentially strong semantic embeddings. Yet, previous work
showed that even slight mistakes during sampling, leading to aliasing, can be
directly attributed to the networks' lack in robustness. To address such issues
and facilitate simpler and faster adversarial training, [12] recently proposed
FLC pooling, a method for provably alias-free downsampling - in theory. In this
work, we conduct a further analysis through the lens of signal processing and
find that such current pooling methods, which address aliasing in the frequency
domain, are still prone to spectral leakage artifacts. Hence, we propose
aliasing and spectral artifact-free pooling, short ASAP. While only introducing
a few modifications to FLC pooling, networks using ASAP as downsampling method
exhibit higher native robustness against common corruptions, a property that
FLC pooling was missing. ASAP also increases native robustness against
adversarial attacks on high and low resolution data while maintaining similar
clean accuracy or even outperforming the baseline.
- PuRe
- BibTeX
185
Thesis
D2IMPR-CS
Y. Liu
“Learning from Imperfect Data Incremental Learning and Few-shot Learning,” Universität des Saarlandes, Saarbrücken, 2023.
186
Paper
D2
Y. Li, M. Keuper, D. Zhang, and A. Khoreva
“Divide & Bind Your Attention for Improved Generative Semantic Nursing,” 2023. [Online]. Available: https://arxiv.org/abs/2307.10864.
mehr
Abstract
Emerging large-scale text-to-image generative models, e.g., Stable Diffusion
(SD), have exhibited overwhelming results with high fidelity. Despite the
magnificent progress, current state-of-the-art models still struggle to
generate images fully adhering to the input prompt. Prior work, Attend &
Excite, has introduced the concept of Generative Semantic Nursing (GSN), aiming
to optimize cross-attention during inference time to better incorporate the
semantics. It demonstrates promising results in generating simple prompts,
e.g., "a cat and a dog". However, its efficacy declines when dealing with more
complex prompts, and it does not explicitly address the problem of improper
attribute binding. To address the challenges posed by complex prompts or
scenarios involving multiple entities and to achieve improved attribute
binding, we propose Divide & Bind. We introduce two novel loss objectives for
GSN: a novel attendance loss and a binding loss. Our approach stands out in its
ability to faithfully synthesize desired objects with improved attribute
alignment from complex prompts and exhibits superior performance across
multiple evaluation benchmarks.
- PuRe
- BibTeX
187
Paper
D2
K. Prasse, S. Jung, Y. Zhou, and M. Keuper
“Local Spherical Harmonics Improve Skeleton-Based Hand Action Recognition,” 2023. [Online]. Available: https://arxiv.org/abs/2308.10557.
mehr
Abstract
Hand action recognition is essential. Communication, human-robot
interactions, and gesture control are dependent on it. Skeleton-based action
recognition traditionally includes hands, which belong to the classes which
remain challenging to correctly recognize to date. We propose a method
specifically designed for hand action recognition which uses relative angular
embeddings and local Spherical Harmonics to create novel hand representations.
The use of Spherical Harmonics creates rotation-invariant representations which
make hand action recognition even more robust against inter-subject differences
and viewpoint changes. We conduct extensive experiments on the hand joints in
the First-Person Hand Action Benchmark with RGB-D Videos and 3D Hand Pose
Annotations, and on the NTU RGB+D 120 dataset, demonstrating the benefit of
using Local Spherical Harmonics Representations. Our code is available at
github.com/KathPra/LSHR_LSHT.
- PuRe
- BibTeX
188
Thesis
D2
E. Schönfeld
“Improving Quality and Controllability in GAN-based Image Synthesis,” Universität des Saarlandes, Saarbrücken, 2023.

2022

189
Conference paper
D2
A. Chaudhuri, M. Mancini, Y. Chen, Z. Akata, and A. Dutta
“Cross-Modal Fusion Distillation for Fine-Grained Sketch-Based Image Retrieval,” in 33rd British Machine Vision Conference (BMVC 2022), London, UK, 2022.
- PuRe
- BibTeX
190
Conference paper
D2
Y. Ma, Y. Chen, and Z. Akata
“Distilling Knowledge from Self-Supervised Teacher by Embedding Graph Alignment,” in 33rd British Machine Vision Conference (BMVC 2022), London, UK, 2022.
- PuRe
- BibTeX
191
Conference paper
D2
Y. Zhou, W. Xiang, C. Li, B. Wang, X. Wei, L. Zhang, M. Keuper, and X. Hua
“SP-ViT: Learning 2D Spatial Priors for Vision Transformers,” in 33rd British Machine Vision Conference (BMVC 2022), London, UK, 2022.
- PuRe
- BibTeX
192
Conference paper
D2
A. Chaudhuri, M. Mancini, Z. Akata, and A. Dutta
“Relational Proxies: Emergent Relationships as Fine-Grained Discriminators,” in Advances in Neural Information Processing Systems 35 (NeurIPS 2022), New Orleans, LA, USA, 2022.
- PuRe
- BibTeX
193
Conference paper
D2
J. Grabinski, P. Gavrikov, J. Keuper, and M. Keuper
“Robust Models are less Over-Confident,” in Advances in Neural Information Processing Systems 35 (NeurIPS 2022), New Orleans, LA, USA, 2022.
- PuRe
- BibTeX
194
Conference paper
D2
A. Saseendran, K. Skubch, and M. Keuper
“Trading off Image Quality for Robustness is not Necessary with Regularized Deterministic Autoencoders,” in Advances in Neural Information Processing Systems 35 (NeurIPS 2022), New Orleans, LA, USA, 2022.
- PuRe
- BibTeX
195
Conference paper
D2
S. Shi, L. Jiang, D. Dai, and B. Schiele
“Motion Transformer with Global Intention Localization and Local Movement Refinement,” in Advances in Neural Information Processing Systems 35 (NeurIPS 2022), New Orleans, LA, USA, 2022.
- PuRe
- BibTeX
196
Conference paper
D2
H. Wang, L. Ding, S. Dong, S. Shi, A. Li, J. Li, Z. Li, and L. Wang
“CAGroup3D: Class-Aware Grouping for 3D Object Detection on Point Clouds,” in Advances in Neural Information Processing Systems 35 (NeurIPS 2022), New Orleans, LA, USA, 2022.
- PuRe
- BibTeX
197
Conference paper
D4D2
Y. Wang, H. Chen, Y. Fan, W. Sun, R. Tao, W. Hou, R. Wang, L. Yang, Z. Zhou, L.-Z. Guo, H. Qi, Z. Wu, Y.-F. Li, S. Nakamura, W. Ye, M. Savvides, B. Raj, T. Shinozaki, B. Schiele, J. Wang, X. Xie, and Y. Zhang
“USB: A Unified Semi-supervised Learning Benchmark for Classification,” in Advances in Neural Information Processing Systems 35 (NeurIPS 2022), New Orleans, LA, USA, 2022.
- PuRe
- BibTeX
198
Conference paper
D2
J. Yang, S. Shi, R. Ding, Z. Wang, and X. Qi
“Towards Efficient 3D Object Detection with Knowledge Distillation,” in Advances in Neural Information Processing Systems 35 (NeurIPS 2022), New Orleans, LA, USA, 2022.
- PuRe
- BibTeX
199
Conference paper
D2
S. Alaniz, M. Mancini, A. Dutta, D. Marcos, and Z. Akata
“Abstracting Sketches Through Simple Primitives,” in Computer Vision -- ECCV 2022, Tel Aviv, Israel, 2022.
200
Conference paper
D2
X. Chen, S. Shi, B. Zhu, K. C. Cheung, H. Xu, and H. Li
“MPPNet: Multi-frame Feature Intertwining with Proxy Points for 3D Temporal Object Detection,” in Computer Vision -- ECCV 2022, Tel Aviv, Israel, 2022.
201
Conference paper
D2
J. Chibane, F. Engelmann, A. T. Tran, and G. Pons-Moll
“Box2Mask: Weakly Supervised 3D Semantic Instance Segmentation using Bounding Boxes,” in Computer Vision -- ECCV 2022, Tel Aviv, Israel, 2022.
202
Conference paper
D2
E. Corona, G. Pons-Moll, G. Alenyà, and F. Moreno-Noguer
“Learned Vertex Descent: A New Direction for 3D Human Model Fitting,” in Computer Vision -- ECCV 2022, Tel Aviv, Israel, 2022.
203
Conference paper
D2
R. Ding, J. Yang, L. Jiang, and X. Qi
“DODA: Data-Oriented Sim-to-Real Domain Adaptation for 3D Semantic Segmentation,” in Computer Vision -- ECCV 2022, Tel Aviv, Israel, 2022.
204
Conference paper
D2
R. Gong, M. Danelljan, D. Dai, D. P. Paudel, A. Chhatkuli, F. Yu, and L. Van Gool
“TACS: Taxonomy Adaptive Cross-Domain Semantic Segmentation,” in Computer Vision -- ECCV 2022, Tel Aviv, Israel, 2022.
205
Conference paper
D2
S. Gong, S. Zhang, J. Yang, D. Dai, and B. Schiele
“Class-Agnostic Object Counting Robust to Intraclass Diversity,” in Computer Vision -- ECCV 2022, Tel Aviv, Israel, 2022.
206
Conference paper
D2
J. Grabinski, S. Jung, J. Keuper, and M. Keuper
“FrequencyLowCut Pooling - Plug & Play against Catastrophic Overfitting,” in Computer Vision -- ECCV 2022, Tel Aviv, Israel, 2022.
207
Conference paper
D2
Y. Guo, D. Stutz, and B. Schiele
“Improving Robustness by Enhancing Weak Subnets,” in Computer Vision -- ECCV 2022, Tel Aviv, Israel, 2022.
208
Conference paper
D2
S. Haller, L. Feineis, L. Hutschenreiter, F. Bernard, C. Rother, D. Kainmüller, P. Swoboda, and B. Savchynskyy
“A Comparative Study of Graph Matching Algorithms in Computer Vision,” in Computer Vision -- ECCV 2022, Tel Aviv, Israel, 2022.
209
Conference paper
D2
L. Hoyer, D. Dai, and L. Van Gool
“HRDA: Context-Aware High-Resolution Domain-Adaptive Semantic Segmentation,” in Computer Vision -- ECCV 2022, Tel Aviv, Israel, 2022.
210
Conference paper
D2
Z. Liao, J. Yang, J. Saito, G. Pons-Moll, and Y. Zhou
“Skeleton-Free Pose Transfer for Stylized 3D Characters,” in Computer Vision -- ECCV 2022, Tel Aviv, Israel, 2022.
211
Conference paper
D2
W. Lin, A. Kukleva, K. Sun, H. Possegger, H. Kuehne, and H. Bischof
“CycDA: Unsupervised Cycle Domain Adaptation to Learn from Image to Video,” in Computer Vision -- ECCV 2022, Tel Aviv, Israel, 2022.
212
Conference paper
D2
J. Lukasik, S. Jung, and M. Keuper
“Learning Where To Look - Generative NAS is Surprisingly Efficient,” in Computer Vision -- ECCV 2022, Tel Aviv, Israel, 2022.
213
Conference paper
D2
O.-B. Mercea, T. Hummel, A. S. Koepke, and Z. Akata
“Temporal and Cross-modal Attention for Audio-Visual Zero-Shot Learning,” in Computer Vision -- ECCV 2022, Tel Aviv, Israel, 2022.
214
Conference paper
D6D2D4
S. Shimada, V. Golyanik, Z. Li, P. Pérez, W. Xu, and C. Theobalt
“HULC: 3D HUman Motion Capture with Pose Manifold SampLing and Dense Contact Guidance,” in Computer Vision -- ECCV 2022, Tel Aviv, Israel, 2022.
215
Conference paper
D2
G. Tiwari, D. Antic, J. E. Lenssen, N. Sarafianos, T. Tung, and G. Pons-Moll
“Pose-NDF: Modeling Human Pose Manifolds with Neural Distance Fields,” in Computer Vision -- ECCV 2022, Tel Aviv, Israel, 2022.
216
Conference paper
D2
X. Xie, B. L. Bhatnagar, and G. Pons-Moll
“CHORE: Contact, Human and Object Reconstruction from a Single RGB Image,” in Computer Vision -- ECCV 2022, Tel Aviv, Israel, 2022.
217
Conference paper
D2
X. Zhang, B. L. Bhatnagar, S. Starke, V. Guzov, and G. Pons-Moll
“COUCH: Towards Controllable Human-Chair Interactions,” in Computer Vision -- ECCV 2022, Tel Aviv, Israel, 2022.
218
Conference paper
D2
K. Zhou, B. L. Bhatnagar, J. E. Lenssen, and G. Pons-Moll
“TOCH: Spatio-Temporal Object Correspondence to Hand for Motion Refinement,” in Computer Vision -- ECCV 2022, Tel Aviv, Israel, 2022.
219
Article
D2
H. Cao, X. Hong, H. Tost, A. Meyer-Lindenberg, and E. Schwarz
“Advancing Translational Research in Neuroscience through Multi-task Learning,” Frontiers in Psychiatry, vol. 13, 2022.
220
Conference paper
D2
S. Alaniz, T. Hummel, and Z. Akata
“Semantic Image Synthesis with Semantically Coupled VQ-Model,” in ICLR Workshop on Deep Generative Models for Highly Structured Data (ICLR 2022 DGM4HSD), Virtual, 2022.
- PuRe
- BibTeX
221
Conference paper
D2
A. Abbas and P. Swoboda
“RAMA: A Rapid Multicut Algorithm on GPU,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2022), New Orleans, LA, USA, 2022.
222
Conference paper
D2
A. Abbas and P. Swoboda
“FastDOG: Fast Discrete Optimization on GPU,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2022), New Orleans, LA, USA, 2022.
223
Conference paper
D2D6
B. L. Bhatnagar, X. Xie, I. Petrov, C. Sminchisescu, C. Theobalt, and G. Pons-Moll
“BEHAVE: Dataset and Method for Tracking Human Object Interactions,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2022), New Orleans, LA, USA, 2022.
224
Conference paper
D2
M. Böhle, M. Fritz, and B. Schiele
“B-cos Networks: Alignment is All We Need for Interpretability,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2022), New Orleans, LA, USA, 2022.
225
Conference paper
D2
S. Cai, A. Obukhov, D. Dai, and L. Van Gool
“Pix2NeRF: Unsupervised Conditional Pi-GAN for Single Image to Neural Radiance Fields Translation,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2022), New Orleans, LA, USA, 2022.
226
Conference paper
D2
J. Ding, N. Xue, G.-S. Xia, and D. Dai
“Decoupling Zero-Shot Semantic Segmentation,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2022), New Orleans, LA, USA, 2022.
mehr
Abstract
Zero-shot semantic segmentation (ZS3) aims to segment the novel categories
that have not been seen in the training. Existing works formulate ZS3 as a
pixel-level zero-shot classification problem, and transfer semantic knowledge
from seen classes to unseen ones with the help of language models pre-trained
only with texts. While simple, the pixel-level ZS3 formulation shows the
limited capability to integrate vision-language models that are often
pre-trained with image-text pairs and currently demonstrate great potential for
vision tasks. Inspired by the observation that humans often perform
segment-level semantic labeling, we propose to decouple the ZS3 into two
sub-tasks: 1) a class-agnostic grouping task to group the pixels into segments.
2) a zero-shot classification task on segments. The former sub-task does not
involve category information and can be directly transferred to group pixels
for unseen classes. The latter subtask performs at segment-level and provides a
natural way to leverage large-scale vision-language models pre-trained with
image-text pairs (e.g. CLIP) for ZS3. Based on the decoupling formulation, we
propose a simple and effective zero-shot semantic segmentation model, called
ZegFormer, which outperforms the previous methods on ZS3 standard benchmarks by
large margins, e.g., 35 points on the PASCAL VOC and 3 points on the COCO-Stuff
in terms of mIoU for unseen classes. Code will be released at
github.com/dingjiansw101/ZegFormer.
227
Conference paper
D2
A. Doering, D. Chen, S. Zhang, B. Schiele, and J. Gall
“PoseTrack21: A Dataset for Person Search, Multi-Object Tracking and Multi-Person Pose Tracking,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2022), New Orleans, LA, USA, 2022.
228
Conference paper
D2
Y. Fan, D. Dai, and B. Schiele
“CoSSL: Co-Learning of Representation and Classifier for Imbalanced Semi-Supervised Learning,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2022), New Orleans, LA, USA, 2022.
mehr
Abstract
In this paper, we propose a novel co-learning framework (CoSSL) with
decoupled representation learning and classifier learning for imbalanced SSL.
To handle the data imbalance, we devise Tail-class Feature Enhancement (TFE)
for classifier learning. Furthermore, the current evaluation protocol for
imbalanced SSL focuses only on balanced test sets, which has limited
practicality in real-world scenarios. Therefore, we further conduct a
comprehensive evaluation under various shifted test distributions. In
experiments, we show that our approach outperforms other methods over a large
range of shifted distributions, achieving state-of-the-art performance on
benchmark datasets ranging from CIFAR-10, CIFAR-100, ImageNet, to Food-101. Our
code will be made publicly available.
229
Conference paper
D2
S. Gong, S. Zhang, J. Yang, D. Dai, and B. Schiele
“Bi-level Alignment for Cross-Domain Crowd Counting,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2022), New Orleans, LA, USA, 2022.
230
Conference paper
D2
M. Hahner, C. Sakaridis, M. Bijelic, F. Heide, F. Yu, D. Dai, and L. Van Gool
“LiDAR Snowfall Simulation for Robust 3D Object Detection,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2022), New Orleans, LA, USA, 2022.
231
Conference paper
D2
L. Hoyer, D. Dai, and L. Van Gool
“DAFormer: Improving Network Architectures and Training Strategies for Domain-Adaptive Semantic Segmentation,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2022), New Orleans, LA, USA, 2022.
mehr
Abstract
As acquiring pixel-wise annotations of real-world images for semantic
segmentation is a costly process, a model can instead be trained with more
accessible synthetic data and adapted to real images without requiring their
annotations. This process is studied in unsupervised domain adaptation (UDA).
Even though a large number of methods propose new adaptation strategies, they
are mostly based on outdated network architectures. As the influence of recent
network architectures has not been systematically studied, we first benchmark
different network architectures for UDA and then propose a novel UDA method,
DAFormer, based on the benchmark results. The DAFormer network consists of a
Transformer encoder and a multi-level context-aware feature fusion decoder. It
is enabled by three simple but crucial training strategies to stabilize the
training and to avoid overfitting DAFormer to the source domain: While the Rare
Class Sampling on the source domain improves the quality of pseudo-labels by
mitigating the confirmation bias of self-training towards common classes, the
Thing-Class ImageNet Feature Distance and a learning rate warmup promote
feature transfer from ImageNet pretraining. DAFormer significantly improves the
state-of-the-art performance by 10.8 mIoU for GTA->Cityscapes and 5.4 mIoU for
Synthia->Cityscapes and enables learning even difficult classes such as train,
bus, and truck well. The implementation is available at
github.com/lhoyer/DAFormer.
232
Conference paper
D2
Y. Kim, J. M. Kim, Z. Akata, and J. Lee
“Large Loss Matters in Weakly Supervised Multi-Label Classification,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2022), New Orleans, LA, USA, 2022.
233
Conference paper
D2
X. Lai, J. Liu, L. Jiang, L. Wang, H. Zhao, S. Liu, X. Qi, and J. Jia
“Stratified Transformer for 3D Point Cloud Segmentation,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2022), New Orleans, LA, USA, 2022.
234
Conference paper
D2
X. Ma, Z. Wang, Y. Zhan, Y. Zheng, Z. Wang, D. Dai, and C.-W. Lin
“Both Style and Fog Matter: Cumulative Domain Adaptation for Semantic Foggy Scene Understanding,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2022), New Orleans, LA, USA, 2022.
mehr
Abstract
Although considerable progress has been made in semantic scene understanding
under clear weather, it is still a tough problem under adverse weather
conditions, such as dense fog, due to the uncertainty caused by imperfect
observations. Besides, difficulties in collecting and labeling foggy images
hinder the progress of this field. Considering the success in semantic scene
understanding under clear weather, we think it is reasonable to transfer
knowledge learned from clear images to the foggy domain. As such, the problem
becomes to bridge the domain gap between clear images and foggy images. Unlike
previous methods that mainly focus on closing the domain gap caused by fog --
defogging the foggy images or fogging the clear images, we propose to alleviate
the domain gap by considering fog influence and style variation simultaneously.
The motivation is based on our finding that the style-related gap and the
fog-related gap can be divided and closed respectively, by adding an
intermediate domain. Thus, we propose a new pipeline to cumulatively adapt
style, fog and the dual-factor (style and fog). Specifically, we devise a
unified framework to disentangle the style factor and the fog factor
separately, and then the dual-factor from images in different domains.
Furthermore, we collaborate the disentanglement of three factors with a novel
cumulative loss to thoroughly disentangle these three factors. Our method
achieves the state-of-the-art performance on three benchmarks and shows
generalization ability in rainy and snowy scenes.
235
Conference paper
D2
O.-B. Mercea, L. Riesch, A. S. Koepke, and Z. Akata
“Audio-visual Generalised Zero-shot Learning with Cross-modal Attention and Language,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2022), New Orleans, LA, USA, 2022.
236
Conference paper
D2
D. H. M. Nguyen, R. Henschel, B. Rosenhahn, D. Sonntag, and P. Swoboda
“LMGP: Lifted Multicut Meets Geometry Projections for Multi-Camera Multi-Object Tracking,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2022), New Orleans, LA, USA, 2022.
mehr
Abstract
Multi-Camera Multi-Object Tracking is currently drawing attention in the
computer vision field due to its superior performance in real-world
applications such as video surveillance with crowded scenes or in vast space.
In this work, we propose a mathematically elegant multi-camera multiple object
tracking approach based on a spatial-temporal lifted multicut formulation. Our
model utilizes state-of-the-art tracklets produced by single-camera trackers as
proposals. As these tracklets may contain ID-Switch errors, we refine them
through a novel pre-clustering obtained from 3D geometry projections. As a
result, we derive a better tracking graph without ID switches and more precise
affinity costs for the data association phase. Tracklets are then matched to
multi-camera trajectories by solving a global lifted multicut formulation that
incorporates short and long-range temporal interactions on tracklets located in
the same camera as well as inter-camera ones. Experimental results on the
WildTrack dataset yield near-perfect result, outperforming state-of-the-art
trackers on Campus while being on par on the PETS-09 dataset. We will make our
implementations available upon acceptance of the paper.
237
Conference paper
D4D2
S. Rao, M. Böhle, and B. Schiele
“Towards Better Understanding Attribution Methods,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2022), New Orleans, LA, USA, 2022.
238
Conference paper
D2
P. Roetzer, P. Swoboda, D. Cremers, and F. Bernard
“A Scalable Combinatorial Solver for Elastic Geometrically Consistent 3D Shape Matching,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2022), New Orleans, LA, USA, 2022.
239
Conference paper
D2
T. Sun, M. Segù, J. Postels, Y. Wang, L. Van Gool, B. Schiele, F. Tombari, and F. Yu
“SHIFT: A Synthetic Driving Dataset for Continuous Multi-Task Domain Adaptation,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2022), New Orleans, LA, USA, 2022.
240
Conference paper
D2
Z. Tian, X. Lai, L. Jiang, S. Liu, M. Shu, H. Zhao, and J. Jia
“Generalized Few-shot Semantic Segmentation,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2022), New Orleans, LA, USA, 2022.
241
Conference paper
D2
O. Unal, D. Dai, and L. Van Gool
“Scribble-Supervised LiDAR Semantic Segmentation,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2022), New Orleans, LA, USA, 2022.
242
Conference paper
D2
A. B. Vasudevan, D. Dai, and L. Van Gool
“Sound and Visual Representation Learning with Multiple Pretraining Tasks,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2022), New Orleans, LA, USA, 2022.
243
Conference paper
D2
H. Wang, S. Shi, Z. Yang, R. Fang, Q. Qian, H. Li, B. Schiele, and L. Wang
“RBGNet: Ray-based Grouping for 3D Object Detection,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2022), New Orleans, LA, USA, 2022.
244
Conference paper
D2
Q. Wang, O. Fink, L. Van Gool, and D. Dai
“Continual Test-Time Domain Adaptation,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2022), New Orleans, LA, USA, 2022.
245
Conference paper
D2
W. Xu, Y. Xian, J. Wang, B. Schiele, and Z. Akata
“VGSE: Visually-Grounded Semantic Embeddings for Zero-Shot Learning,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2022), New Orleans, LA, USA, 2022.
246
Conference paper
D2
Z. Yang, L. Jiang, Y. Sun, B. Schiele, and J. Jia
“A Unified Query-based Paradigm for Point Cloud Understanding,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2022), New Orleans, LA, USA, 2022.
247
Conference paper
D2
J.-N. Zaech, A. Liniger, M. Danelljan, D. Dai, and L. Van Gool
“Adiabatic Quantum Computing for Multi Object Tracking,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2022), New Orleans, LA, USA, 2022.
248
Article
D2
S. Li, X. Chen, Y. Liu, D. Dai, C. Stachniss, and J. Gall
“Multi-Scale Interaction for Real-Time LiDAR Data Segmentation on an Embedded Platform,” IEEE Robotics and Automation Letters, vol. 7, no. 2, 2022.
249
Article
D2
V. Patil, A. Liniger, D. Dai, and L. Van Gool
“Improving Depth Estimation Using Map-Based Depth Priors,” IEEE Robotics and Automation Letters, vol. 7, no. 2, 2022.
250
Article
D2
N. Vödisch, O. Unal, K. Li, L. Van Gool, and D. Dai
“End-to-End Optimization of LiDAR Beam Configuration for 3D Object Detection and Localization,” IEEE Robotics and Automation Letters, vol. 7, no. 2, 2022.
251
Article
D2
J.-N. Zaech, D. Dai, A. Liniger, M. Danelljan, and L. Van Gool
“Learnable Online Graph Representations for 3D Multi-Object Tracking,” IEEE Robotics and Automation Letters, vol. 7, no. 2, 2022.
252
Article
D2
J. Dong, S. Roth, and B. Schiele
“DWDN: Deep Wiener Deconvolution Network for Non-Blind Image Deblurring,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 44, no. 12, 2022.
253
Article
D2
Q. Sun, Y. Liu, Z. Chen, T.-S. Chua, and B. Schiele
“Meta-Transfer Learning through Hard Tasks,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 44, no. 3, 2022.
254
Article
D2
Y. Xian, B. Korbar, M. Douze, L. Torresani, B. Schiele, and Z. Akata
“Generalized Few-Shot Video Classification With Video Retrieval and Feature Generation,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 44, no. 12, 2022.
255
Conference paper
D2
K. Li, D. Dai, and L. van Gool
“Hyperspectral Image Super-Resolution with RGB Image Super-Resolution as an Auxiliary Task,” in 2022 IEEE Winter Conference on Applications of Computer Vision (WACV 2022), Waikoloa Village, HI, USA, 2022.
256
Article
D2
D. H. M. Nguyen, D. M. Nguyen, T. T. N. Mai, T. Nguyen, K. T. Tran, A. T. Nguyen, B. T. Pham, and B. T. Nguyen
“ASMCNN: An Efficient Brain Extraction Using Active Shape Model and Convolutional Neural Networks,” Information Sciences, vol. 591, 2022.
257
Conference paper
D2D6
Z. Li, S. Shimada, B. Schiele, C. Theobalt, and V. Golyanik
“MoCapDeform: Monocular 3D Human Motion Capture in Deformable Scenes,” in International Conference on 3D Vision, Hybrid / Prague, Czechia, 2022.
mehr
Abstract
3D human motion capture from monocular RGB images respecting interactions of
a subject with complex and possibly deformable environments is a very
challenging, ill-posed and under-explored problem. Existing methods address it
only weakly and do not model possible surface deformations often occurring when
humans interact with scene surfaces. In contrast, this paper proposes
MoCapDeform, i.e., a new framework for monocular 3D human motion capture that
is the first to explicitly model non-rigid deformations of a 3D scene for
improved 3D human pose estimation and deformable environment reconstruction.
MoCapDeform accepts a monocular RGB video and a 3D scene mesh aligned in the
camera space. It first localises a subject in the input monocular video along
with dense contact labels using a new raycasting based strategy. Next, our
human-environment interaction constraints are leveraged to jointly optimise
global 3D human poses and non-rigid surface deformations. MoCapDeform achieves
superior accuracy than competing methods on several datasets, including our
newly recorded one with deforming background scenes.
258
Article
D2
S. Shi, L. Jiang, J. Deng, Z. Wang, C. Guo, J. Shi, X. Wang, and H. Li
“PV-RCNN++: Point-Voxel Feature Set Abstraction With Local Vector Representation for 3D Object Detection,” International Journal of Computer Vision, vol. 131, 2022.
259
Article
D2
V. Sushko, E. Schönfeld, D. Zhang, J. Gall, B. Schiele, and A. Khoreva
“OASIS: Only Adversarial Supervision for Semantic Image Synthesis,” International Journal of Computer Vision, vol. 130, 2022.
260
Article
D2
W. Xu, Y. Xian, J. Wang, B. Schiele, and Z. Akata
“Attribute Prototype Network for Any-Shot Learning,” International Journal of Computer Vision, vol. 130, 2022.
261
Article
D2
T. T. Nguyen, K. M. Nguyen-Duy, D. H. M. Nguyen, B. T. Nguyen, and B. A. Wade
“DPER: Direct Parameter Estimation for Randomly Missing Data,” Knowledge-Based Systems, vol. 240, 2022.
262
Article
D2
J. Grabinski, J. Keuper, and M. Keuper
“Aliasing and Adversarial Robust Generalization of CNNs,” Machine Learning, vol. 111, 2022.
263
Conference paper
D2
S. Jung and M. Keuper
“Learning to solve Minimum Cost Multicuts efficiently using Edge-Weighted Graph Convolutional Neural Networks,” in Machine Learning and Knowledge Discovery in Databases (ECML PKDD 2022), Grenoble, France, 2022.
mehr
Abstract
The minimum cost multicut problem is the NP-hard/APX-hard combinatorial
optimization problem of partitioning a real-valued edge-weighted graph such as
to minimize the total cost of the partition. While graph convolutional neural
networks (GNN) have proven to be promising in the context of combinatorial
optimization, most of them are only tailored to or tested on positive-valued
edge weights, i.e. they do not comply to the nature of the multicut problem. We
therefore adapt various GNN architectures including Graph Convolutional
Networks, Signed Graph Convolutional Networks and Graph Isomorphic Networks to
facilitate the efficient encoding of real-valued edge costs. Moreover, we
employ a reformulation of the multicut ILP constraints to a polynomial program
as loss function that allows to learn feasible multicut solutions in a scalable
way. Thus, we provide the first approach towards end-to-end trainable
multicuts. Our findings support that GNN approaches can produce good solutions
in practice while providing lower computation times and largely improved
scalability compared to LP solvers and optimized heuristics, especially when
considering large instances.
- PuRe
- BibTeX
264
Article
D2
D. H. M. Nguyen, T. T. Nguyen, H. Vu, Q. Pham, B. T. Nguyen, D. Sonntag, and M.-D. Nguyen
“TATL: Task Agnostic Transfer Learning for Skin Attributes Detection,” Medical Image Analysis, vol. 78, 2022.
265
Conference paper
D2
P. Müller, A. Braun, and M. Keuper
“Impact of Realistic Properties of the Point Spread Function on Classification Tasks to Reveal a Possible Distribution Shift,” in NeurIPS 2022 Workshop on Distribution Shifts: Connecting Methods and Applications (NeurIPS 2022 Workshop DistShift), New Orelans, LA, USA, 2022.
- PuRe
- BibTeX
266
Conference paper
D2
S. Jung, S. Ziegler, A. Kardoost, and M. Keuper
“Optimizing Edge Detection for Image Segmentation with Multicut Penalties,” in Pattern Recognition (DAGM GCPR 2022), Konstanz, Germany, 2022.
mehr
Abstract
The Minimum Cost Multicut Problem (MP) is a popular way for obtaining a graph
decomposition by optimizing binary edge labels over edge costs. While the
formulation of a MP from independently estimated costs per edge is highly
flexible and intuitive, solving the MP is NP-hard and time-expensive. As a
remedy, recent work proposed to predict edge probabilities with awareness to
potential conflicts by incorporating cycle constraints in the prediction
process. We argue that such formulation, while providing a first step towards
end-to-end learnable edge weights, is suboptimal, since it is built upon a
loose relaxation of the MP. We therefore propose an adaptive CRF that allows to
progressively consider more violated constraints and, in consequence, to issue
solutions with higher validity. Experiments on the BSDS500 benchmark for
natural image segmentation as well as on electron microscopic recordings show
that our approach yields more precise edge detection and image segmentation.
267
Conference paper
D2
D. Chen, A. Doering, S. Zhang, J. Yang, J. Gall, and B. Schiele
“Keypoint Message Passing for Video-Based Person Re-identification,” in Proceedings of the 36th AAAI Conference on Artificial Intelligence, Virtual Conference, 2022.
268
Conference paper
D2
K. Renz, K. Chitta, O.-B. Mercea, A. S. Koepke, Z. Akata, and A. Geiger
“PlanT: Explainable Planning Transformers via Object-Level Representations,” in Proceedings of the 6th Annual Conference on Robot Learning (CoRL 2022), Auckland, New Zealand, 2022.
mehr
Abstract
Planning an optimal route in a complex environment requires efficient
reasoning about the surrounding scene. While human drivers prioritize important
objects and ignore details not relevant to the decision, learning-based
planners typically extract features from dense, high-dimensional grid
representations containing all vehicle and road context information. In this
paper, we propose PlanT, a novel approach for planning in the context of
self-driving that uses a standard transformer architecture. PlanT is based on
imitation learning with a compact object-level input representation. On the
Longest6 benchmark for CARLA, PlanT outperforms all prior methods (matching the
driving score of the expert) while being 5.3x faster than equivalent
pixel-based planning baselines during inference. Combining PlanT with an
off-the-shelf perception module provides a sensor-based driving system that is
more than 10 points better in terms of driving score than the existing state of
the art. Furthermore, we propose an evaluation protocol to quantify the ability
of planners to identify relevant objects, providing insights regarding their
decision-making. Our results indicate that PlanT can focus on the most relevant
object in the scene, even when this object is geometrically distant.
- PuRe
- BibTeX
269
Conference paper
D2
D. Pu, X. Hong, P.-J. Lin, E. Chang, and V. Demberg
“Two-Stage Movie Script Summarization: An Efficient Method For Low-Resource Long Document Summarization,” in Proceedings of The Workshop on Automatic Summarization for Creative Writing (COLING 2022), Gyeongju, Republic of Korea, 2022.
- PuRe
- BibTeX
270
Paper
D2
H. Chen, Y. Fan, Y. Wang, J. Wang, B. Schiele, X. Xie, M. Savvides, and B. Raj
“An Embarrassingly Simple Baseline for Imbalanced Semi-Supervised Learning,” 2022. [Online]. Available: https://arxiv.org/abs/2211.11086.
mehr
Abstract
Semi-supervised learning (SSL) has shown great promise in leveraging
unlabeled data to improve model performance. While standard SSL assumes uniform
data distribution, we consider a more realistic and challenging setting called
imbalanced SSL, where imbalanced class distributions occur in both labeled and
unlabeled data. Although there are existing endeavors to tackle this challenge,
their performance degenerates when facing severe imbalance since they can not
reduce the class imbalance sufficiently and effectively. In this paper, we
study a simple yet overlooked baseline -- SimiS -- which tackles data imbalance
by simply supplementing labeled data with pseudo-labels, according to the
difference in class distribution from the most frequent class. Such a simple
baseline turns out to be highly effective in reducing class imbalance. It
outperforms existing methods by a significant margin, e.g., 12.8%, 13.6%, and
16.7% over previous SOTA on CIFAR100-LT, FOOD101-LT, and ImageNet127
respectively. The reduced imbalance results in faster convergence and better
pseudo-label accuracy of SimiS. The simplicity of our method also makes it
possible to be combined with other re-balancing techniques to improve the
performance further. Moreover, our method shows great robustness to a wide
range of data distributions, which holds enormous potential in practice. Code
will be publicly available.
- PuRe
- BibTeX
271
Paper
D2
E. Duka, A. Kukleva, and B. Schiele
“Leveraging Self-Supervised Training for Unintentional Action Recognition,” 2022. [Online]. Available: https://arxiv.org/abs/2209.11870.
mehr
Abstract
Unintentional actions are rare occurrences that are difficult to define
precisely and that are highly dependent on the temporal context of the action.
In this work, we explore such actions and seek to identify the points in videos
where the actions transition from intentional to unintentional. We propose a
multi-stage framework that exploits inherent biases such as motion speed,
motion direction, and order to recognize unintentional actions. To enhance
representations via self-supervised training for the task of unintentional
action recognition we propose temporal transformations, called Temporal
Transformations of Inherent Biases of Unintentional Actions (T2IBUA). The
multi-stage approach models the temporal information on both the level of
individual frames and full clips. These enhanced representations show strong
performance for unintentional action recognition tasks. We provide an extensive
ablation study of our framework and report results that significantly improve
over the state-of-the-art.
- PuRe
- BibTeX
272
Paper
D2
Q. Fan, M. Segu, Y.-W. Tai, F. Yu, C.-K. Tang, B. Schiele, and D. Dai
“Normalization Perturbation: A Simple Domain Generalization Method for Real-World Domain Shifts,” 2022. [Online]. Available: https://arxiv.org/abs/2211.04393.
mehr
Abstract
Improving model's generalizability against domain shifts is crucial,
especially for safety-critical applications such as autonomous driving.
Real-world domain styles can vary substantially due to environment changes and
sensor noises, but deep models only know the training domain style. Such domain
style gap impedes model generalization on diverse real-world domains. Our
proposed Normalization Perturbation (NP) can effectively overcome this domain
style overfitting problem. We observe that this problem is mainly caused by the
biased distribution of low-level features learned in shallow CNN layers. Thus,
we propose to perturb the channel statistics of source domain features to
synthesize various latent styles, so that the trained deep model can perceive
diverse potential domains and generalizes well even without observations of
target domain data in training. We further explore the style-sensitive channels
for effective style synthesis. Normalization Perturbation only relies on a
single source domain and is surprisingly effective and extremely easy to
implement. Extensive experiments verify the effectiveness of our method for
generalizing models under real-world domain shifts.
- PuRe
- BibTeX
273
Paper
D2
V. Guzov, T. Sattler, and G. Pons-Moll
“Visually Plausible Human-Object Interaction Capture from Wearable Sensors,” 2022. [Online]. Available: https://arxiv.org/abs/2205.02830.
mehr
Abstract
In everyday lives, humans naturally modify the surrounding environment
through interactions, e.g., moving a chair to sit on it. To reproduce such
interactions in virtual spaces (e.g., metaverse), we need to be able to capture
and model them, including changes in the scene geometry, ideally from
ego-centric input alone (head camera and body-worn inertial sensors). This is
an extremely hard problem, especially since the object/scene might not be
visible from the head camera (e.g., a human not looking at a chair while
sitting down, or not looking at the door handle while opening a door). In this
paper, we present HOPS, the first method to capture interactions such as
dragging objects and opening doors from ego-centric data alone. Central to our
method is reasoning about human-object interactions, allowing to track objects
even when they are not visible from the head camera. HOPS localizes and
registers both the human and the dynamic object in a pre-scanned static scene.
HOPS is an important first step towards advanced AR/VR applications based on
immersive virtual universes, and can provide human-centric training data to
teach machines to interact with their surroundings. The supplementary video,
data, and code will be available on our project page at
virtualhumans.mpi-inf.mpg.de/hops/
- PuRe
- BibTeX
274
Thesis
D2IMPR-CS
A. Horňáková
“Lifted Edges as Connectivity Priors for Multicut and Disjoint Paths,” Universität des Saarlandes, Saarbrücken, 2022.
275
Paper
D2
G.-P. Ji, D.-P. Fan, Y.-C. Chou, D. Dai, A. Liniger, and L. Van Gool
“Deep Gradient Learning for Efficient Camouflaged Object Detection,” 2022. [Online]. Available: https://arxiv.org/pdf/2205.12853.pdf.
mehr
Abstract
This paper introduces DGNet, a novel deep framework that exploits object
gradient supervision for camouflaged object detection (COD). It decouples the
task into two connected branches, i.e., a context and a texture encoder. The
essential connection is the gradient-induced transition, representing a soft
grouping between context and texture features. Benefiting from the simple but
efficient framework, DGNet outperforms existing state-of-the-art COD models by
a large margin. Notably, our efficient version, DGNet-S, runs in real-time (80
fps) and achieves comparable results to the cutting-edge model
JCSOD-CVPR$_{21}$ with only 6.82% parameters. Application results also show
that the proposed DGNet performs well in polyp segmentation, defect detection,
and transparent object segmentation tasks. Codes will be made available at
github.com/GewelsJI/DGNet.
- PuRe
- BibTeX
276
Paper
D2
S. Shi, L. Jiang, D. Dai, and B. Schiele
“MTR-A: 1st Place Solution for 2022 Waymo Open Dataset Challenge -- Motion Prediction,” 2022. [Online]. Available: https://arxiv.org/abs/2209.10033.
mehr
Abstract
In this report, we present the 1st place solution for motion prediction track
in 2022 Waymo Open Dataset Challenges. We propose a novel Motion Transformer
framework for multimodal motion prediction, which introduces a small set of
novel motion query pairs for generating better multimodal future trajectories
by jointly performing the intention localization and iterative motion
refinement. A simple model ensemble strategy with non-maximum-suppression is
adopted to further boost the final performance. Our approach achieves the 1st
place on the motion prediction leaderboard of 2022 Waymo Open Dataset
Challenges, outperforming other methods with remarkable margins. Code will be
available at github.com/sshaoshuai/MTR.
- PuRe
- BibTeX
277
Thesis
D2IMPR-CS
D. Stutz
“Understanding and Improving Robustness and Uncertainty Estimation in Deep Learning,” Universität des Saarlandes, Saarbrücken, 2022.
mehr
Abstract
Deep learning is becoming increasingly relevant for many high-stakes applications such as autonomous driving or medical diagnosis where wrong decisions can have massive impact on human lives. Unfortunately, deep neural networks are typically assessed solely based on generalization, e.g., accuracy on a fixed test set. However, this is clearly insufficient for safe deployment as potential malicious actors and distribution shifts or the effects of quantization and unreliable hardware are disregarded. Thus, recent work additionally evaluates performance on potentially manipulated or corrupted inputs as well as after quantization and deployment on specialized hardware. In such settings, it is also important to obtain reasonable estimates of the model's confidence alongside its predictions. This thesis studies robustness and uncertainty estimation in deep learning along three main directions: First, we consider so-called adversarial examples, slightly perturbed inputs causing severe drops in accuracy. Second, we study weight perturbations, focusing particularly on bit errors in quantized weights. This is relevant for deploying models on special-purpose hardware for efficient inference, so-called accelerators. Finally, we address uncertainty estimation to improve robustness and provide meaningful statistical performance guarantees for safe deployment. In detail, we study the existence of adversarial examples with respect to the underlying data manifold. In this context, we also investigate adversarial training which improves robustness by augmenting training with adversarial examples at the cost of reduced accuracy. We show that regular adversarial examples leave the data manifold in an almost orthogonal direction. While we find no inherent trade-off between robustness and accuracy, this contributes to a higher sample complexity as well as severe overfitting of adversarial training. Using a novel measure of flatness in the robust loss landscape with respect to weight changes, we also show that robust overfitting is caused by converging to particularly sharp minima. In fact, we find a clear correlation between flatness and good robust generalization. Further, we study random and adversarial bit errors in quantized weights. In accelerators, random bit errors occur in the memory when reducing voltage with the goal of improving energy-efficiency. Here, we consider a robust quantization scheme, use weight clipping as regularization and perform random bit error training to improve bit error robustness, allowing considerable energy savings without requiring hardware changes. In contrast, adversarial bit errors are maliciously introduced through hardware- or software-based attacks on the memory, with severe consequences on performance. We propose a novel adversarial bit error attack to study this threat and use adversarial bit error training to improve robustness and thereby also the accelerator's security. Finally, we view robustness in the context of uncertainty estimation. By encouraging low-confidence predictions on adversarial examples, our confidence-calibrated adversarial training successfully rejects adversarial, corrupted as well as out-of-distribution examples at test time. Thereby, we are also able to improve the robustness-accuracy trade-off compared to regular adversarial training. However, even robust models do not provide any guarantee for safe deployment. To address this problem, conformal prediction allows the model to predict confidence sets with user-specified guarantee of including the true label. Unfortunately, as conformal prediction is usually applied after training, the model is trained without taking this calibration step into account. To address this limitation, we propose conformal training which allows training conformal predictors end-to-end with the underlying model. This not only improves the obtained uncertainty estimates but also enables optimizing application-specific objectives without losing the provided guarantee. Besides our work on robustness or uncertainty, we also address the problem of 3D shape completion of partially observed point clouds. Specifically, we consider an autonomous driving or robotics setting where vehicles are commonly equipped with LiDAR or depth sensors and obtaining a complete 3D representation of the environment is crucial. However, ground truth shapes that are essential for applying deep learning techniques are extremely difficult to obtain. Thus, we propose a weakly-supervised approach that can be trained on the incomplete point clouds while offering efficient inference. In summary, this thesis contributes to our understanding of robustness against both input and weight perturbations. To this end, we also develop methods to improve robustness alongside uncertainty estimation for safe deployment of deep learning methods in high-stakes applications. In the particular context of autonomous driving, we also address 3D shape completion of sparse point clouds.
278
Paper
D2
P. Swoboda, A. Horňáková, P. Rötzer, B. Savchynskyy, and A. Abbas
“Structured Prediction Problem Archive,” 2022. [Online]. Available: https://arxiv.org/abs/2202.03574.
mehr
Abstract
Structured prediction problems are one of the fundamental tools in machine
learning. In order to facilitate algorithm development for their numerical
solution, we collect in one place a large number of datasets in easy to read
formats for a diverse set of problem classes. We provide archival links to
datasets, description of the considered problems and problem formats, and a
short summary of problem characteristics including size, number of instances
etc. For reference we also give a non-exhaustive selection of algorithms
proposed in the literature for their solution. We hope that this central
repository will make benchmarking and comparison to established works easier.
We welcome submission of interesting new datasets and algorithms for inclusion
in our archive.
- PuRe
- BibTeX
279
Paper
D2
N. P. Walter, D. Stutz, and B. Schiele
“On Fragile Features and Batch Normalization in Adversarial Training,” 2022. [Online]. Available: https://arxiv.org/abs/2204.12393.
mehr
Abstract
Modern deep learning architecture utilize batch normalization (BN) to
stabilize training and improve accuracy. It has been shown that the BN layers
alone are surprisingly expressive. In the context of robustness against
adversarial examples, however, BN is argued to increase vulnerability. That is,
BN helps to learn fragile features. Nevertheless, BN is still used in
adversarial training, which is the de-facto standard to learn robust features.
In order to shed light on the role of BN in adversarial training, we
investigate to what extent the expressiveness of BN can be used to robustify
fragile features in comparison to random features. On CIFAR10, we find that
adversarially fine-tuning just the BN layers can result in non-trivial
adversarial robustness. Adversarially training only the BN layers from scratch,
in contrast, is not able to convey meaningful adversarial robustness. Our
results indicate that fragile features can be used to learn models with
moderate adversarial robustness, while random features cannot
- PuRe
- BibTeX
280
Paper
D2
Y.-H. Wu, D. Zhang, L. Zhang, X. Zhan, D. Dai, Y. Liu, and M.-M. Cheng
“Ret3D: Rethinking Object Relations for Efficient 3D Object Detection in Driving Scenes,” 2022. [Online]. Available: https://arxiv.org/abs/2208.08621.
mehr
Abstract
Current efficient LiDAR-based detection frameworks are lacking in exploiting
object relations, which naturally present in both spatial and temporal manners.
To this end, we introduce a simple, efficient, and effective two-stage
detector, termed as Ret3D. At the core of Ret3D is the utilization of novel
intra-frame and inter-frame relation modules to capture the spatial and
temporal relations accordingly. More Specifically, intra-frame relation module
(IntraRM) encapsulates the intra-frame objects into a sparse graph and thus
allows us to refine the object features through efficient message passing. On
the other hand, inter-frame relation module (InterRM) densely connects each
object in its corresponding tracked sequences dynamically, and leverages such
temporal information to further enhance its representations efficiently through
a lightweight transformer network. We instantiate our novel designs of IntraRM
and InterRM with general center-based or anchor-based detectors and evaluate
them on Waymo Open Dataset (WOD). With negligible extra overhead, Ret3D
achieves the state-of-the-art performance, being 5.5% and 3.2% higher than the
recent competitor in terms of the LEVEL 1 and LEVEL 2 mAPH metrics on vehicle
detection, respectively.
- PuRe
- BibTeX
281
Paper
D2
K. Zhou, B. Lal Bhatnagar, J. E. Lenssen, and G. Pons-Moll
“TOCH: Spatio-Temporal Object Correspondence to Hand for Motion Refinement,” 2022. [Online]. Available: https://arxiv.org/abs/2205.07982.
mehr
Abstract
We present TOCH, a method for refining incorrect 3D hand-object interaction
sequences using a data prior. Existing hand trackers, especially those that
rely on very few cameras, often produce visually unrealistic results with
hand-object intersection or missing contacts. Although correcting such errors
requires reasoning about temporal aspects of interaction, most previous work
focus on static grasps and contacts. The core of our method are TOCH fields, a
novel spatio-temporal representation for modeling correspondences between hands
and objects during interaction. The key component is a point-wise
object-centric representation which encodes the hand position relative to the
object. Leveraging this novel representation, we learn a latent manifold of
plausible TOCH fields with a temporal denoising auto-encoder. Experiments
demonstrate that TOCH outperforms state-of-the-art (SOTA) 3D hand-object
interaction models, which are limited to static grasps and contacts. More
importantly, our method produces smooth interactions even before and after
contact. Using a single trained TOCH model, we quantitatively and qualitatively
demonstrate its usefulness for 1) correcting erroneous reconstruction results
from off-the-shelf RGB/RGB-D hand-object reconstruction methods, 2) de-noising,
and 3) grasp transfer across objects. We will release our code and trained
model on our project page at virtualhumans.mpi-inf.mpg.de/toch/
- PuRe
- BibTeX
282
Paper
D2
Y. Zhou, C. Li, Z.-Q. Cheng, Y. Geng, X. Xie, and M. Keuper
“Hypergraph Transformer for Skeleton-based Action Recognition,” 2022. [Online]. Available: https://arxiv.org/abs/2211.09590.
mehr
Abstract
Skeleton-based action recognition aims to predict human actions given human
joint coordinates with skeletal interconnections. To model such off-grid data
points and their co-occurrences, Transformer-based formulations would be a
natural choice. However, Transformers still lag behind state-of-the-art methods
using graph convolutional networks (GCNs). Transformers assume that the input
is permutation-invariant and homogeneous (partially alleviated by positional
encoding), which ignores an important characteristic of skeleton data, i.e.,
bone connectivity. Furthermore, each type of body joint has a clear physical
meaning in human motion, i.e., motion retains an intrinsic relationship
regardless of the joint coordinates, which is not explored in Transformers. In
fact, certain re-occurring groups of body joints are often involved in specific
actions, such as the subconscious hand movement for keeping balance. Vanilla
attention is incapable of describing such underlying relations that are
persistent and beyond pair-wise. In this work, we aim to exploit these unique
aspects of skeleton data to close the performance gap between Transformers and
GCNs. Specifically, we propose a new self-attention (SA) extension, named
Hypergraph Self-Attention (HyperSA), to incorporate inherently higher-order
relations into the model. The K-hop relative positional embeddings are also
employed to take bone connectivity into account. We name the resulting model
Hyperformer, and it achieves comparable or better performance w.r.t. accuracy
and efficiency than state-of-the-art GCN architectures on NTU RGB+D, NTU RGB+D
120, and Northwestern-UCLA datasets. On the largest NTU RGB+D 120 dataset, the
significantly improved performance reached by our Hyperformer demonstrates the
underestimated potential of Transformer models in this field.
- PuRe
- BibTeX

2021

283
Article
D6D2
M. Habermann, L. Liu, W. Xu, M. Zollhöfer, G. Pons-Moll, and C. Theobalt
“Real-time Deep Dynamic Characters,” ACM Transactions on Graphics (Proc. ACM SIGGRAPH 2021), vol. 40, no. 4, 2021.
284
Conference paper
D2
A. Abbas and P. Swoboda
“Combinatorial Optimization for Panoptic Segmentation: A Fully Differentiable Approach,” in Advances in Neural Information Processing Systems 34 (NeurIPS 2021), Virtual, 2021.
- PuRe
- BibTeX
285
Conference paper
D2
S. Badirli, Z. Akata, G. Mohler, C. Picard, and M. M. Dundar
“Fine-Grained Zero-Shot Learning with DNA as Side Information,” in Advances in Neural Information Processing Systems 34 (NeurIPS 2021), Virtual, 2021.
- PuRe
- BibTeX
286
Conference paper
D2
Y. Liu, B. Schiele, and Q. Sun
“RMM: Reinforced Memory Management for Class-Incremental Learning,” in Advances in Neural Information Processing Systems 34 (NeurIPS 2021), Virtual, 2021.
- PuRe
- BibTeX
287
Conference paper
D2
A. Saseendran, K. Skubch, S. Falkner, and M. Keuper
“Shape your Space: A Gaussian Mixture Regularization Approach to Deterministic Autoencoders,” in Advances in Neural Information Processing Systems 34 pre-proceedings (NeurIPS 2021), Virtual Event, 2021.
- PuRe
- BibTeX
288
Article
D2
Y. Guo, L. Ma, Z. Li, X. Wang, and F. Wang
“Monocular 3D Multi-Person Pose Estimation via Predicting Factorized Correction Factors,” Computer Vision and Image Understanding, vol. 213, 2021.
289
Article
D2
X. Li, J. Huang, Y. Liu, Q. Zhou, S. Zheng, B. Schiele, and Q. Sun
“Learning to Teach and Learn for Semi-supervised Few-shot Image Classification,” Computer Vision and Image Understanding, vol. 212, 2021.
290
Conference paper
D2
R. Gong, D. Dai, Y. Chen, W. Li, and L. Van Gool
“mDALU: Multi-Source Domain Adaptation and Label Unification with Partial Datasets,” in ICCV 2021, IEEE/CVF International Conference on Computer Vision, Virtual Event, 2021.
291
Conference paper
D2
M. Hahner, C. Sakaridis, D. Dai, and L. Van Gool
“Fog Simulation on Real LiDAR Point Clouds for 3D Object Detection in Adverse Weather,” in ICCV 2021, IEEE/CVF International Conference on Computer Vision, Virtual Event, 2021.
292
Conference paper
D2
A. Horňáková, T. Kaiser, P. Swoboda, M. Rolinek, B. Rosenhahn, and R. Henschel
“Making Higher Order MOT Scalable: An Efficient Approximate Solver for Lifted Disjoint Paths,” in ICCV 2021, IEEE/CVF International Conference on Computer Vision, Virtual Event, 2021.
293
Conference paper
D2
M. Kayser, O.-M. Camburu, L. Salewski, C. Emde, V. Do, Z. Akata, and T. Lukasiewicz
“e-ViL: A Dataset and Benchmark for Natural Language Explanations in Vision-Language Tasks,” in ICCV 2021, IEEE/CVF International Conference on Computer Vision, Virtual Event, 2021.
294
Conference paper
D2
J. M. Kim, J. Choe, Z. Akata, and S. J. Oh
“Keep CALM and Improve Visual Feature Attribution,” in ICCV 2021, IEEE/CVF International Conference on Computer Vision, Virtual Event, 2021.
295
Conference paper
D2
A. Kukleva, H. Kuehne, and B. Schiele
“Generalized and Incremental Few-Shot Learning by Explicit Learning and Calibration without Forgetting,” in ICCV 2021, IEEE/CVF International Conference on Computer Vision, Virtual Event, 2021.
296
Conference paper
D2
F. Rezaeianaran, R. Shetty, R. Aljundi, D. O. Reino, S. Zhang, and B. Schiele
“Seeking Similarities over Differences: Similarity-based Domain Alignment for Adaptive Object Detection,” in ICCV 2021, IEEE/CVF International Conference on Computer Vision, Virtual Event, 2021.
297
Conference paper
D2
C. Sakaridis, D. Dai, and L. Van Gool
“ACDC: The Adverse Conditions Dataset with Correspondences for Semantic Driving Scene Understanding,” in ICCV 2021, IEEE/CVF International Conference on Computer Vision, Virtual Event, 2021.
298
Conference paper
D2
D. Stutz, M. Hein, and B. Schiele
“Relating Adversarially Robust Generalization to Flat Minima,” in ICCV 2021, IEEE/CVF International Conference on Computer Vision, Virtual Event, 2021.
299
Conference paper
D2
G. Sun, T. Probst, D. P. Paudel, N. Popovic, M. Kanakis, J. Patel, D. Dai, and L. Van Gool
“Task Switching Network for Multi-task Learning,” in ICCV 2021, IEEE/CVF International Conference on Computer Vision, Virtual Event, 2021.
300
Conference paper
D2
G. Tiwari, N. Sarafianos, T. Tung, and G. Pons-Moll
“Neural-GIF: Neural Generalized Implicit Functions for Animating People in Clothing,” in ICCV 2021, IEEE/CVF International Conference on Computer Vision, Virtual Event, 2021.
301
Conference paper
D2
Q. Wang, D. Dai, L. Hoyer, L. Van Gool, and O. Fink
“Domain Adaptive Semantic Segmentation with Self-Supervised Depth Estimation,” in ICCV 2021, IEEE/CVF International Conference on Computer Vision, Virtual Event, 2021.
302
Conference paper
D2
N. Yu, V. Skripniuk, S. Abdelnabi, and M. Fritz
“Artificial Fingerprinting for Generative Models: Rooting Deepfake Attribution in Training Data,” in ICCV 2021, IEEE/CVF International Conference on Computer Vision, Virtual Event, 2021.
303
Conference paper
D2
N. Yu, G. Liu, A. Dundar, A. Tao, B. Catanzaro, L. Davis, and M. Fritz
“Dual Contrastive Loss and Attention for GANs,” in ICCV 2021, IEEE/CVF International Conference on Computer Vision, Virtual Event, 2021.
304
Conference paper
D2
Z. Zhang, A. Liniger, D. Dai, F. Yu, and L. Van Gool
“End-to-End Urban Driving by Imitating a Reinforcement Learning Coach,” in ICCV 2021, IEEE/CVF International Conference on Computer Vision, Virtual Event, 2021.
305
Conference paper
D2
S. Alaniz, D. Marcos, B. Schiele, and Z. Akata
“Learning Decision Trees Recurrently Through Communication,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2021), Nashville, TN, USA (Virtual), 2021.
306
Conference paper
D2
A. Bhattacharyya, D. O. Reino, M. Fritz, and B. Schiele
“Euro-PVI: Pedestrian Vehicle Interactions in Dense Urban Centers,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2021), Nashville, TN, USA (Virtual), 2021.
307
Conference paper
D2
M. Böhle, M. Fritz, and B. Schiele
“Convolutional Dynamic Alignment Networks for Interpretable Classifications,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2021), Nashville, TN, USA (Virtual), 2021.
308
Conference paper
D2
Y. Chen, Y. Xian, A. S. Koepke, and Z. Akata
“Distilling Audio-Visual Knowledge by Compositional Contrastive Learning,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2021), Virtual Conference, 2021.
309
Conference paper
D2
J. Chibane, A. Bansal, V. Lazova, and G. Pons-Moll
“Stereo Radiance Fields (SRF): Learning View Synthesis from Sparse Views of Novel Scenes,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2021), Nashville, TN, USA (Virtual), 2021.
310
Conference paper
D2
J. Dong, S. Roth, and B. Schiele
“Learning Spatially-Variant MAP Models for Non-blind Image Deblurring,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2021), Nashville, TN, USA (Virtual), 2021.
311
Conference paper
D2
V. Guzov, A. Mir, T. Sattler,, and G. Pons-Moll
“Human POSEitioning System (HPS): 3D Human Pose Estimation and Self-localization in Large Scenes from Body-Mounted Sensors,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2021), Nashville, TN, US (Virtual), 2021.
312
Conference paper
D2
Y. Liu, B. Schiele, and Q. Sun
“Adaptive Aggregation Networks for Class-Incremental Learning,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2021), Nashville, TN, USA (Virtual), 2021.
313
Conference paper
D2
M. Mancini, M. F. Naeem, Y. Xian, and Z. Akata
“Open World Compositional Zero-Shot Learning,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2021), Virtual Conference, 2021.
314
Conference paper
D2
M. F. Naeem, Y. Xian, F. Tombari, and Z. Akata
“Learning Graph Embeddings for Compositional Zero-shot Learning,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2021), Nashville, TN, US (Virtual), 2021.
315
Conference paper
D2
G. Pons-Moll, F. Moreno-Noguer, E. Corona, A. Pumarola, and G. Alenyà
“SMPLicit: Topology-aware Generative Model for Clothed People,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2021), Virtual Conference, 2021.
316
Conference paper
D2
A. Pumarola, E. Corona, G. Pons-Moll, and F. Moreno-Noguer
“D-NeRF: Neural Radiance Fields for Dynamic Scenes,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2021), Nashville, TN, US (Virtual), 2021.
317
Conference paper
D2
H.-P. Wang, N. Yu, and M. Fritz
“Hijack-GAN: Unintended-Use of Pretrained, Black-Box GANs,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2021), Virtual Conference, 2021.
318
Article
D2
J. Dong and J. Pan
“Deep Outlier Handling for Image Deblurring,” IEEE Transactions on Image Processing, vol. 30, 2021.
319
Article
D2
Y. Liu, Q. Sun, X. He, A.-A. Liu, Y. Su, and T.-S. Chua
“Generating Face Images With Attributes for Free,” IEEE Transactions on Neural Networks and Learning Systems, vol. 32, no. 6, 2021.
320
Conference paper
D2
Q. Ke, M. Fritz, and B. Schiele
“Future Moment Assessment for Action Query,” in IEEE Winter Conference on Applications of Computer Vision (WACV 2021), Virtual Event, 2021.
321
Conference paper
D2
R. G. VidalMata, W. J. Scheirer, A. Kukleva, D. Cox, and H. Kuehne
“Joint Visual-Temporal Embedding for Unsupervised Learning of Actions in Untrimmed Sequences,” in IEEE Winter Conference on Applications of Computer Vision (WACV 2021), Virtual, 2021.
- PuRe
- BibTeX
322
Article
D2
T. Nguyen, D. H. M. Nguyen, H. Nguyen, B. T. Nguyen, and B. A. Wade
“EPEM: Efficient Parameter Estimation for Multiple Class Monotone Missing Data,” Information Sciences, vol. 567, 2021.
323
Conference paper
D2
E. Schönfeld, V. Sushko, D. Zhang, J. Gall, B. Schiele, and A. Khoreva
“You Only Need Adversarial Supervision for Semantic Image Synthesis,” in International Conference on Learning Representations (ICLR 2021), Vienna, Austria (Virtual), 2021.
- PuRe
- BibTeX
324
Article
D2
D. Chen, S. Zhang, J. Yang, and B. Schiele
“Norm-Aware Embedding for Efficient Person Search and Tracking,” International Journal of Computer Vision, vol. 129, 2021.
325
Article
D2
D. Dai, R. T. Tan, V. Patel, J. Matas, B. Schiele, and L. Van Gool
“Guest Editorial: Special Issue on ‘Computer Vision for All Seasons: Adverse Weather and Lighting Conditions,’” International Journal of Computer Vision, vol. 129, 2021.
326
Article
D2
R. Gong, W. Li, Y. Chen, D. Dai, and L. Van Gool
“DLOW: Domain Flow and Applications,” International Journal of Computer Vision, vol. 129, 2021.
327
Article
D2
M. Losch, M. Fritz, and B. Schiele
“Semantic Bottlenecks: Quantifying and Improving Inspectability of Deep Representations,” International Journal of Computer Vision, vol. 129, 2021.
328
Article
D2
S. Zhang, D. Chen, J. Yang, and B. Schiele
“Guided Attention in CNNs for Occluded Pedestrian Detection and Re-identification,” International Journal of Computer Vision, vol. 129, 2021.
329
Conference paper
D2
H. Hajipour, A. Bhattacharyya, C.-A. Staicu, and M. Fritz
“SampleFix: Learning to Correct Programs by Sampling Diverse Fixes,” in Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML PKDD 2021), Virtual Event, 2021.
330
Conference paper
D2
J. Geiping, J. Lukasik, M. Keuper, and M. Moeller
“DARTS for Inverse Problems: a Study on Stability,” in NeurIPS 2021 Workshop on Deep Learning and Inverse Problems (NeurIPS 2021 Deep Inverse Workshop), Virtual, 2021.
- PuRe
- BibTeX
331
Conference paper
D2
S. Jung and M. Keuper
“Internalized Biases in Fréchet Inception Distance,” in NeurIPS 2021 Workshop on Distribution Shifts: Connecting Methods and Applications (NeurIPS 2021 Workshop DistShift), Virtual, 2021.
- PuRe
- BibTeX
332
Conference paper
D2
A. Das, Y. Xian, Y. He, B. Schiele, and Z. Akata
“(SP)2Net for Generalized Zero-Label Semantic Segmentation,” in Pattern Recognition (GCPR 2021), Bonn, Germany, 2022.
333
Conference paper
D2
Y. Fan, A. Kukleva, and B. Schiele
“Revisiting Consistency Regularization for Semi-supervised Learning,” in Pattern Recognition (GCPR 2021), Bonn, Germany, 2022.
334
Conference paper
D2
J.-H. Lange and P. Swoboda
“Efficient Message Passing for 0–1 ILPs with Binary Decision Diagrams,” in Proceedings of the 38th International Conference on Machine Learning (ICML 2021), Virtual Event, 2021.
- PuRe
- BibTeX
335
Conference paper
D2
D. Stutz, N. Chandramoorthy, M. Hein, and B. Schiele
“Bit Error Robustness for Energy-Efficient DNN Accelerators,” in Proceedings of the 4th MLSys Conference, Virtual Conference, 2021.
mehr
Abstract
Deep neural network (DNN) accelerators received considerable attention in
past years due to saved energy compared to mainstream hardware. Low-voltage
operation of DNN accelerators allows to further reduce energy consumption
significantly, however, causes bit-level failures in the memory storing the
quantized DNN weights. In this paper, we show that a combination of robust
fixed-point quantization, weight clipping, and random bit error training
(RandBET) improves robustness against random bit errors in (quantized) DNN
weights significantly. This leads to high energy savings from both low-voltage
operation as well as low-precision quantization. Our approach generalizes
across operating voltages and accelerators, as demonstrated on bit errors from
profiled SRAM arrays. We also discuss why weight clipping alone is already a
quite effective way to achieve robustness against bit errors. Moreover, we
specifically discuss the involved trade-offs regarding accuracy, robustness and
precision: Without losing more than 1% in accuracy compared to a normally
trained 8-bit DNN, we can reduce energy consumption on CIFAR-10 by 20%. Higher
energy savings of, e.g., 30%, are possible at the cost of 2.5% accuracy, even
for 4-bit DNNs.
- PuRe
- BibTeX
336
Conference paper
D2
S. Alaniz, M. Federici, and Z. Akata
“Compositional Mixture Representations for Vision and Text,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPR 2022), New Orleans, LA, USA, 2022.
337
Conference paper
D2
A. Neculai, Y. Chen, and Z. Akata
“Probabilistic Compositional Embeddings for Multimodal Image Retrieval,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPR 2022), New Orleans, LA, USA, 2022.
338
Conference paper
D2
G. Pastore, F. Cermelli, Y. Xian, M. Mancini, Z. Akata, and B. Caputo
“A Closer Look at Self-training for Zero-Label Semantic Segmentation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPR 2021), Virtual Workshop, 2021.
339
Conference paper
D2
H.-P. Wang, T. Orekondy, and M. Fritz
“InfoScrub: Towards Attribute Privacy by Targeted Obfuscation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPR 2021), Virtual Workshop, 2021.
340
Conference paper
D2
Y. He, N. Yu, M. Keuper, and M. Fritz
“Beyond the Spectrum: Detecting Deepfakes via Re-Synthesis,” in Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence (IJCAI 2021), Montreal, Canada, 2021.
341
Conference paper
D2
S. Jung and M. Keuper
“Spectral Distribution Aware Image Generation,” in Thirty-Fifth AAAI Conference on Artificial Intelligence Technical Tracks 2, Virtual Conference, 2021.
342
Paper
D2
A. Abbas and P. Swoboda
“FastDOG: Fast Discrete Optimization on GPU,” 2021. [Online]. Available: https://arxiv.org/abs/2111.10270.
mehr
Abstract
We present a massively parallel Lagrange decomposition method for solving 0-1
integer linear programs occurring in structured prediction. We propose a new
iterative update scheme for solving the Lagrangean dual and a perturbation
technique for decoding primal solutions. For representing subproblems we follow
Lange et al. (2021) and use binary decision diagrams (BDDs). Our primal and
dual algorithms require little synchronization between subproblems and
optimization over BDDs needs only elementary operations without complicated
control flow. This allows us to exploit the parallelism offered by GPUs for all
components of our method. We present experimental results on combinatorial
problems from MAP inference for Markov Random Fields, quadratic assignment and
cell tracking for developmental biology. Our highly parallel GPU implementation
improves upon the running times of the algorithms from Lange et al. (2021) by
up to an order of magnitude. In particular, we come close to or outperform some
state-of-the-art specialized heuristics while being problem agnostic.
- PuRe
- BibTeX
343
Thesis
D2IMPR-CS
A. Bhattacharyya
“Long-term future prediction under uncertainty and multi-modality,” Universität des Saarlandes, Saarbrücken, 2021.
344
Paper
D2
Y. Chen, T. Hummel, A. S. Koepke, and Z. Akata
“Where and When: Space-Time Attention for Audio-Visual Explanations,” 2021. [Online]. Available: https://arxiv.org/abs/2105.01517.
mehr
Abstract
Explaining the decision of a multi-modal decision-maker requires to determine
the evidence from both modalities. Recent advances in XAI provide explanations
for models trained on still images. However, when it comes to modeling multiple
sensory modalities in a dynamic world, it remains underexplored how to
demystify the mysterious dynamics of a complex multi-modal model. In this work,
we take a crucial step forward and explore learnable explanations for
audio-visual recognition. Specifically, we propose a novel space-time attention
network that uncovers the synergistic dynamics of audio and visual data over
both space and time. Our model is capable of predicting the audio-visual video
events, while justifying its decision by localizing where the relevant visual
cues appear, and when the predicted sounds occur in videos. We benchmark our
model on three audio-visual video event datasets, comparing extensively to
multiple recent multi-modal representation learners and intrinsic explanation
models. Experimental results demonstrate the clear superior performance of our
model over the existing methods on audio-visual video event recognition.
Moreover, we conduct an in-depth study to analyze the explainability of our
model based on robustness analysis via perturbation tests and pointing games
using human annotations.
- PuRe
- BibTeX
345
Paper
D2
R. Gong, M. Danelljan, D. Dai, W. Wang, D. P. Paudel, A. Chhatkuli, F. Yu, and L. Van Gool
“TADA: Taxonomy Adaptive Domain Adaptation,” 2021. [Online]. Available: https://arxiv.org/abs/2109.04813.
mehr
Abstract
Traditional domain adaptation addresses the task of adapting a model to a
novel target domain under limited or no additional supervision. While tackling
the input domain gap, the standard domain adaptation settings assume no domain
change in the output space. In semantic prediction tasks, different datasets
are often labeled according to different semantic taxonomies. In many
real-world settings, the target domain task requires a different taxonomy than
the one imposed by the source domain. We therefore introduce the more general
taxonomy adaptive domain adaptation (TADA) problem, allowing for inconsistent
taxonomies between the two domains. We further propose an approach that jointly
addresses the image-level and label-level domain adaptation. On the
label-level, we employ a bilateral mixed sampling strategy to augment the
target domain, and a relabelling method to unify and align the label spaces. We
address the image-level domain gap by proposing an uncertainty-rectified
contrastive learning method, leading to more domain-invariant and class
discriminative features. We extensively evaluate the effectiveness of our
framework under different TADA settings: open taxonomy, coarse-to-fine
taxonomy, and partially-overlapping taxonomy. Our framework outperforms
previous state-of-the-art by a large margin, while capable of adapting to
target taxonomies.
- PuRe
- BibTeX
346
Paper
D2
M. Mancini, M. F. Naeem, Y. Xian, and Z. Akata
“Learning Graph Embeddings for Open World Compositional Zero-Shot Learning,” 2021. [Online]. Available: https://arxiv.org/abs/2105.01017.
mehr
Abstract
Compositional Zero-Shot learning (CZSL) aims to recognize unseen compositions
of state and object visual primitives seen during training. A problem with
standard CZSL is the assumption of knowing which unseen compositions will be
available at test time. In this work, we overcome this assumption operating on
the open world setting, where no limit is imposed on the compositional space at
test time, and the search space contains a large number of unseen compositions.
To address this problem, we propose a new approach, Compositional Cosine Graph
Embeddings (Co-CGE), based on two principles. First, Co-CGE models the
dependency between states, objects and their compositions through a graph
convolutional neural network. The graph propagates information from seen to
unseen concepts, improving their representations. Second, since not all unseen
compositions are equally feasible, and less feasible ones may damage the
learned representations, Co-CGE estimates a feasibility score for each unseen
composition, using the scores as margins in a cosine similarity-based loss and
as weights in the adjacency matrix of the graphs. Experiments show that our
approach achieves state-of-the-art performances in standard CZSL while
outperforming previous methods in the open world scenario.
- PuRe
- BibTeX
347
Thesis
D2IMPR-CSD4
M. Omran
“From Pixels to People,” Universität des Saarlandes, Saarbrücken, 2021.
mehr
Abstract
Abstract
Humans are at the centre of a significant amount of research in computer vision.
Endowing machines with the ability to perceive people from visual data is an immense
scientific challenge with a high degree of direct practical relevance. Success in automatic
perception can be measured at different levels of abstraction, and this will depend on
which intelligent behaviour we are trying to replicate: the ability to localise persons in
an image or in the environment, understanding how persons are moving at the skeleton
and at the surface level, interpreting their interactions with the environment including
with other people, and perhaps even anticipating future actions. In this thesis we tackle
different sub-problems of the broad research area referred to as "looking at people",
aiming to perceive humans in images at different levels of granularity.
We start with bounding box-level pedestrian detection: We present a retrospective
analysis of methods published in the decade preceding our work, identifying various
strands of research that have advanced the state of the art. With quantitative exper-
iments, we demonstrate the critical role of developing better feature representations
and having the right training distribution. We then contribute two methods based
on the insights derived from our analysis: one that combines the strongest aspects of
past detectors and another that focuses purely on learning representations. The latter
method outperforms more complicated approaches, especially those based on hand-
crafted features. We conclude our work on pedestrian detection with a forward-looking
analysis that maps out potential avenues for future research.
We then turn to pixel-level methods: Perceiving humans requires us to both separate
them precisely from the background and identify their surroundings. To this end, we
introduce Cityscapes, a large-scale dataset for street scene understanding. This has since
established itself as a go-to benchmark for segmentation and detection. We additionally
develop methods that relax the requirement for expensive pixel-level annotations, focusing
on the task of boundary detection, i.e. identifying the outlines of relevant objects and
surfaces. Next, we make the jump from pixels to 3D surfaces, from localising and
labelling to fine-grained spatial understanding. We contribute a method for recovering
3D human shape and pose, which marries the advantages of learning-based and model-
based approaches.
We conclude the thesis with a detailed discussion of benchmarking practices in
computer vision. Among other things, we argue that the design of future datasets
should be driven by the general goal of combinatorial robustness besides task-specific
considerations.
348
Thesis
D2IMPR-CS
R. Shetty
“Adversarial Content Manipulation for Analyzing and Improving Model Robustness,” Universität des Saarlandes, Saarbrücken, 2021.
349
Paper
D2
K. Zhou, B. L. Bhatnagar, B. Schiele, and G. Pons-Moll
“Adjoint Rigid Transform Network: Task-conditioned Alignment of 3D Shapes,” 2021. [Online]. Available: https://arxiv.org/abs/2102.01161.
mehr
Abstract
Most learning methods for 3D data (point clouds, meshes) suffer significant
performance drops when the data is not carefully aligned to a canonical
orientation. Aligning real world 3D data collected from different sources is
non-trivial and requires manual intervention. In this paper, we propose the
Adjoint Rigid Transform (ART) Network, a neural module which can be integrated
with a variety of 3D networks to significantly boost their performance. ART
learns to rotate input shapes to a learned canonical orientation, which is
crucial for a lot of tasks such as shape reconstruction, interpolation,
non-rigid registration, and latent disentanglement. ART achieves this with
self-supervision and a rotation equivariance constraint on predicted rotations.
The remarkable result is that with only self-supervision, ART facilitates
learning a unique canonical orientation for both rigid and nonrigid shapes,
which leads to a notable boost in performance of aforementioned tasks. We will
release our code and pre-trained models for further research.
- PuRe
- BibTeX

2020

350
Conference paper
D2
D. Chen, S. Zhang, W. Ouyang, J. Yang, and B. Schiele
“Hierarchical Online Instance Matching for Person Search,” in AAAI Technical Track: Vision, New York, NY, USA, 2020.
351
Article
D2
L. Karacan, Z. Akata, A. Erdem, and E. Erdem
“Manipulating Attributes of Natural Scenes via Hallucination,” ACM Transactions on Graphics, vol. 39, no. 1, 2020.
352
Article
D4D2
D. Mehta, O. Sotnychenko, F. Mueller, W. Xu, M. Elgharib, P. Fua, H.-P. Seidel, H. Rhodin, G. Pons-Moll, and C. Theobalt
“XNect: Real-time Multi-person 3D Motion Capture with a Single RGB Camera,” ACM Transactions on Graphics, vol. 39, no. 4, 2020.
353
Article
D4D2
D. Mehta, O. Sotnychenko, F. Mueller, W. Xu, M. Elgharib, P. Fua, H.-P. Seidel, H. Rhodin, G. Pons-Moll, and C. Theobalt
“XNect: Real-time Multi-person 3D Human Pose Estimation with a Single RGB Camera,” ACM Transactions on Graphics (Proc. ACM SIGGRAPH 2020), vol. 39, no. 4, 2020.
354
Conference paper
D2D4
B. L. Bhatnagar, C. Sminchisescu, C. Theobalt, and G. Pons-Moll
“LoopReg: Self-supervised Learning of Implicit Surface Correspondences, Pose and Shape for 3D Human Mesh Registration,” in Advances in Neural Information Processing Systems 33 (NeurIPS 2020), Virtual Event, 2020.
- PuRe
- BibTeX
355
Conference paper
D2
D. Chen, T. Orekondy, and M. Fritz
“GS-WGAN: A Gradient-Sanitized Approach for Learning Differentially Private Generators,” in Advances in Neural Information Processing Systems 33 (NeurIPS 2020), Virtual Event, 2020.
- PuRe
- BibTeX
356
Conference paper
D2
J. Chibane, A. Mir, and G. Pons-Moll
“Neural Unsigned Distance Fields for Implicit Function Learning,” in Advances in Neural Information Processing Systems 33 (NeurIPS 2020), Virtual Event, 2020.
- PuRe
- BibTeX
357
Conference paper
D2
J. Dong, S. Roth, and B. Schiele
“Deep Wiener Deconvolution: Wiener Meets Deep Learning for Image Deblurring,” in Advances in Neural Information Processing Systems 33 (NeurIPS 2020), Virtual Event, 2020.
- PuRe
- BibTeX
358
Conference paper
D2
W. Xu, Y. Xian, J. Wang, B. Schiele, and Z. Akata
“Attribute Prototype Network for Zero-Shot Learning,” in Advances in Neural Information Processing Systems 33 (NeurIPS 2020), Virtual Event, 2020.
- PuRe
- BibTeX
359
Conference paper
D2
D. Chen, N. Yu, Y. Zhang, and M. Fritz
“GAN-Leaks: A Taxonomy of Membership Inference Attacks against GANs,” in CCS ’20, ACM SIGSAC Conference on Computer and Communications Security, Virtual Event, USA, 2020.
360
Conference paper
D2D4
B. L. Bhatnagar, C. Sminchisescu, C. Theobalt, and G. Pons-Moll
“Combining Implicit Function Learning and Parametric Models for 3D Human Reconstruction,” in Computer Vision -- ECCV 2020, Glasgow, UK, 2020.
361
Conference paper
D2
G. Brazil, G. Pons-Moll, X. Liu, and B. Schiele
“Kinematic 3D Object Detection in Monocular Video,” in Computer Vision -- ECCV 2020, Glasgow, UK, 2020.
362
Conference paper
D2
B. Deng, J. P. Lewis, T. Jeruzalski, G. Pons-Moll, G. Hinton, M. Norouzi, and A. Tagliasacchi
“NASA: Neural Articulated Shape Approximation,” in Computer Vision -- ECCV 2020, Glasgow, UK, 2020.
363
Conference paper
D2
Y. He, S. Rahimian, B. Schiele, and M. Fritz
“Segmentations-Leak: Membership Inference Attacks and Defenses in Semantic Image Segmentation,” in Computer Vision -- ECCV 2020, Glasgow, UK, 2020.
364
Conference paper
D2
Y. Liu, B. Schiele, and Q. Sun
“An Ensemble of Epoch-wise Empirical Bayes for Few-shot Learning,” in Computer Vision -- ECCV 2020, Glasgow, UK, 2020.
365
Conference paper
D2
M. Mancini, Z. Akata, E. Ricci, and B. Caputo
“Towards Recognizing Unseen Categories in Unseen Domains,” in Computer Vision -- ECCV 2020, Glasgow, UK, 2020.
366
Conference paper
D2
M. Rolínek, P. Swoboda, D. Zietlow, A. Paulus, V. Musil, and G. Martius
“Deep Graph Matching via Blackbox Differentiation of Combinatorial Solvers,” in Computer Vision -- ECCV 2020, Glasgow, UK, 2020.
367
Conference paper
D2
R. Shetty, M. Fritz, and B. Schiele
“Towards Automated Testing and Robustification by Semantic Adversarial Data Generation,” in Computer Vision -- ECCV 2020, Glasgow, UK, 2020.
368
Conference paper
D2
G. Tiwari, B. L. Bhatnagar, T. Tung, and G. Pons-Moll
“SIZER: A Dataset and Model for Parsing 3D Clothing and Learning Size Sensitive 3D Clothing,” in Computer Vision -- ECCV 2020, Glasgow, UK, 2020.
369
Conference paper
D2
N. Yu, K. Li, P. Zhou, J. Malik, L. Davis, and M. Fritz
“Inclusive GAN: Improving Data and Minority Coverage in Generative Models,” in Computer Vision -- ECCV 2020, Glasgow, UK, 2020.
370
Conference paper
D2
K. Zhou, B. L. Bhatnagar, and G. Pons-Moll
“Unsupervised Shape and Pose Disentanglement for 3D Meshes,” in Computer Vision -- ECCV 2020, Glasgow, UK, 2020.
371
Conference paper
D2
J. Chibane and G. Pons-Moll
“Implicit Feature Networks for Texture Completion from Partial 3D Data,” in Computer Vision -- ECCV Workshops 2020, Glasgow, UK, 2020.
372
Conference paper
D2
Y. He, B. Schiele, and M. Fritz
“Synthetic Convolutional Features for Improved Semantic Segmentation,” in Computer Vision -- ECCV Workshops 2020, Glasgow, UK, 2021.
373
Conference paper
D4D2
S. Rao, D. Stutz, and B. Schiele
“Adversarial Training Against Location-Optimized Adversarial Patches,” in Computer Vision -- ECCV Workshops 2020, Glasgow, UK, 2021.
374
Conference paper
D2
A. Saint, A. Kacem, K. Cherenkova, K. Papadopoulos, J. Chibane, G. Pons-Moll, G. Gusev, D. Fofi, D. Aouada, and B. Ottersten
“SHARP 2020: The 1st Shape Recovery from Partial Textured 3D Scans Challenge Results,” in Computer Vision -- ECCV Workshops 2020, Glasgow, UK, 2021.
375
Conference paper
D2
H. Sattar, K. Krombholz, G. Pons-Moll, and M. Fritz
“Body Shape Privacy in Images: Understanding Privacy and Preventing Automatic Shape Extraction,” in Computer Vision -- ECCV Workshops 2020, Glasgow, UK, 2021.
mehr
Abstract
Modern approaches to pose and body shape estimation have recently achieved
strong performance even under challenging real-world conditions. Even from a
single image of a clothed person, a realistic looking body shape can be
inferred that captures a users' weight group and body shape type well. This
opens up a whole spectrum of applications -- in particular in fashion -- where
virtual try-on and recommendation systems can make use of these new and
automatized cues. However, a realistic depiction of the undressed body is
regarded highly private and therefore might not be consented by most people.
Hence, we ask if the automatic extraction of such information can be
effectively evaded. While adversarial perturbations have been shown to be
effective for manipulating the output of machine learning models -- in
particular, end-to-end deep learning approaches -- state of the art shape
estimation methods are composed of multiple stages. We perform the first
investigation of different strategies that can be used to effectively
manipulate the automatic shape estimation while preserving the overall
appearance of the original image.
376
Conference paper
D2
Y. Xian, B. Korbar, M. Douze, B. Schiele, Z. Akata, and L. Torresani
“Generalized Many-Way Few-Shot Video Classification,” in Computer Vision -- ECCV Workshops 2020, Glasgow, UK, 2020.
377
Article
D2
J.-H. Lange, M. E. Pfetsch, B. M.Seib, and A. M.Tillmann
“Sparse Recovery with Integrality Constraints,” Discrete Applied Mathematics, vol. 283, 2020.
378
Conference paper
D2
V. Agarwal, R. Shetty, and M. Fritz
“Towards Causal VQA: Revealing and Reducing Spurious Correlations by Invariant and Covariant Semantic Editing,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2020), Seattle, WA, USA (Virtual), 2020.
379
Conference paper
D2
A. Bhattacharyya, S. Mahajan, M. Fritz, B. Schiele, and S. Roth
“Normalizing Flows With Multi-Scale Autoregressive Priors,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2020), Seattle, WA, USA (Virtual), 2020.
380
Conference paper
D2
D. Chen, S. Zhang, J. Yang, and B. Schiele
“Norm-Aware Embedding for Efficient Person Search,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2020), Seattle, WA, USA (Virtual), 2020.
381
Conference paper
D2
J. Chibane, T. Alldieck, and G. Pons-Moll
“Implicit Functions in Feature Space for 3D Shape Reconstruction and Completion,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2020), Seattle, WA, USA (Virtual), 2020.
382
Conference paper
D2
J. Choe, S. J. Oh, S. Lee, S. Chun, Z. Akata, and H. Shim
“Evaluating Weakly Supervised Object Localization Methods Right,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2020), Seattle, WA, USA (Virtual), 2020.
383
Conference paper
D4D2
M. Habermann, W. Xu, M. Zollhöfer, G. Pons-Moll, and C. Theobalt
“DeepCap: Monocular Human Performance Capture Using Weak Supervision,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2020), Seattle, WA, USA (Virtual), 2020.
384
Conference paper
D2
A. Kukleva, M. Tapaswi, and I. Laptev
“Learning Interactions and Relationships between Movie Characters,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2020), Seattle, WA, USA (Virtual), 2020.
385
Conference paper
D2
Y. Liu, Y. Su, A.-A. Liu, B. Schiele, and Q. Sun
“Mnemonics Training: Multi-Class Incremental Learning Without Forgetting,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2020), Seattle, WA, USA (Virtual), 2020.
386
Conference paper
D2
Q. Ma, J. Yang, A. Ranjan, S. Pujades, G. Pons-Moll, S. Tang, and M. J. Black
“Learning to Dress 3D People in Generative Clothing,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2020), Seattle, WA, USA (Virtual), 2020.
387
Conference paper
D2
A. Mir, T. Alldieck, and G. Pons-Moll
“Learning to Transfer Texture from Clothing Images to 3D Humans,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2020), Seattle, WA, USA (Virtual), 2020.
388
Conference paper
D2
C. Patel, Z. Liao, and G. Pons-Moll
“TailorNet: Predicting Clothing in 3D as a Function of Human Pose, Shape and Garment Style,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2020), Seattle, WA, USA (Virtual), 2020.
389
Conference paper
D2
E. Schönfeld, B. Schiele, and A. Khoreva
“A U-Net Based Discriminator for Generative Adversarial Networks,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2020), Seattle, WA, USA (Virtual), 2020.
390
Article
D2
M. Keuper, S. Tang, B. Andres, T. Brox, and B. Schiele
“Motion Segmentation & Multiple Object Tracking by Correlation Co-Clustering,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 42, no. 1, 2020.
391
Article
D2
S. J. Oh, R. Benenson, M. Fritz, and B. Schiele
“Person Recognition in Personal Photo Collections,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 42, no. 1, 2020.
392
Article
D2
T. Yu, Z. Zheng, K. Guo, J. Zhao, Q. Dai, H. Li, G. Pons-Moll, and Y. Liu
“DoubleFusion: Real-time Capture of Human Performances with Inner Body Shapes from a Single Depth Sensor,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 42, no. 10, 2020.
393
Conference paper
D2
M. Federici, A. Dutta, P. Forré, N. Kushman, and Z. Akata
“Learning Robust Representations via Multi-View Information Bottleneck,” in International Conference on Learning Representations (ICLR 2020), Addis Ababa, Ethopia, 2020.
- PuRe
- BibTeX
394
Conference paper
D2
T. Orekondy, B. Schiele, and M. Fritz
“Prediction Poisoning: Towards Defenses Against DNN Model Stealing Attacks,” in International Conference on Learning Representations (ICLR 2020), Addis Ababa, Ethopia, 2020.
- PuRe
- BibTeX
395
Article
D2
A. Dutta and Z. Akata
“Semantically Tied Paired Cycle Consistency for Any-Shot Sketch-based Image Retrieval,” International Journal of Computer Vision, vol. 128, 2020.
396
Article
D2
H. Sattar, M. Fritz, and A. Bulling
“Deep Gaze Pooling: Inferring and Visually Decoding Search Intents from Human Gaze Fixations,” Neurocomputing, vol. 387, 2020.
397
Conference paper
D2
A. Bhattacharyya, C.-N. Straehle, M. Fritz, and B. Schiele
“Haar Wavelet based Block Autoregressive Flows for Trajectories,” in Pattern Recognition (GCPR 2020), Tübingen, Germany, 2021.
398
Conference paper
D2
Y. Fan, Y. Xian, M. M. Losch, and B. Schiele
“Analyzing the Dependency of ConvNets on Spatial Information,” in Pattern Recognition (GCPR 2020), Tübingen, Germany, 2021.
399
Conference paper
D2
Y. A. Farha, Q. Ke, B. Schiele, and J. Gall
“Long-Term Anticipation of Activities with Cycle Consistency,” in Pattern Recognition (GCPR 2020), Tübingen, Germany, 2021.
400
Conference paper
D2
J.-H. Lange and B. Andres
“On the Lifted Multicut Polytope for Trees,” in Pattern Recognition (GCPR 2020), Tübingen, Germany, 2021.
401
Conference paper
D2
M. Losch, M. Fritz, and B. Schiele
“Semantic Bottlenecks: Quantifying & Improving Inspectability of Deep Representations,” in Pattern Recognition (GCPR 2020), Tübingen, Germany, 2021.
- PuRe
- BibTeX
402
Conference paper
D2
S. Sharma, N. Yu, M. Fritz, and B. Schiele
“Long-Tailed Recognition Using Class-Balanced Experts,” in Pattern Recognition (GCPR 2020), Tübingen, Germany, 2021.
403
Conference paper
D2
P. Müller, E. Sood, and A. Bulling
“Anticipating Averted Gaze in Dyadic Interactions,” in Proceedings ETRA 2020 Full Papers, Stuttgart, Germany, 2020.
404
Conference paper
D2
X. Hong, R. Shetty, A. Sayeed, K. Mehra, V. Demberg, and B. Schiele
“Diverse and Relevant Visual Storytelling with Scene Graph Embeddings,” in Proceedings of the 24th Conference on Computational Natural Language Learning (CoNLL 2020), Online, 2020.
405
Conference paper
D2
A. M. G. Salem, A. Bhattacharyya, M. Backes, M. Fritz, and Y. Zhang
“Updates-Leak: Data Set Inference and Reconstruction Attacks in Online Learning,” in Proceedings of the 29th USENIX Security Symposium, Virtual Event, 2020.
- PuRe
- BibTeX
406
Conference paper
D2
A. Horňáková, R. Henschel, B. Rosenhahn, and P. Swoboda
“Lifted Disjoint Paths with Application in Multiple Object Tracking,” in Proceedings of the 37th International Conference on Machine Learning (ICML 2020), Virtual Conference, 2020.
- PuRe
- BibTeX
407
Conference paper
D2
D. Stutz, M. Hein, and B. Schiele
“Confidence-Calibrated Adversarial Training: Generalizing to Unseen Attacks,” in Proceedings of the 37th International Conference on Machine Learning (ICML 2020), Virtual Conference, 2020.
- PuRe
- BibTeX
408
Conference paper
D2
S. Haller, M. Prakash, L. Hutschenreiter, T. Pietzsch, C. Rother, F. Jug, P. Swoboda, and B. Savchynskyy
“A Primal-Dual Solver for Large-Scale Tracking-by-Assignment,” in Proceedings of the Twenty Third International Conference on Artificial Intelligence and Statistics (AISTATS 2020), Virtual Conference, 2020.
- PuRe
- BibTeX
409
Conference paper
D2
L. Salewski, A. S. Koepke, H. P. A. Lensch, and Z. Akata
“CLEVR-X: A Visual Reasoning Dataset for Natural Language Explanations,” in xxAI -- Beyond Explainable AI (xxAI @ICML 2020), Vienna, Austria (Virtually), 2022.
- PuRe
- BibTeX
410
Conference paper
D2
L. Salewski, A. S. Koepke, H. P. A. Lensch, and Z. Akata
“CLEVR-X: A Visual Reasoning Dataset for Natural Language Explanations,” in xxAI -- Beyond Explainable AI (XXAI @ICML 2020), Vienna, Austria (Virtually), 2022.
411
Paper
D2
A. Doering, D. Chen, S. Zhang, B. Schiele, and J. Gall
“PoseTrackReID: Dataset Description,” 2020. [Online]. Available: https://arxiv.org/abs/2011.06243.
mehr
Abstract
Current datasets for video-based person re-identification (re-ID) do not
include structural knowledge in form of human pose annotations for the persons
of interest. Nonetheless, pose information is very helpful to disentangle
useful feature information from background or occlusion noise. Especially
real-world scenarios, such as surveillance, contain a lot of occlusions in
human crowds or by obstacles. On the other hand, video-based person re-ID can
benefit other tasks such as multi-person pose tracking in terms of robust
feature matching. For that reason, we present PoseTrackReID, a large-scale
dataset for multi-person pose tracking and video-based person re-ID. With
PoseTrackReID, we want to bridge the gap between person re-ID and multi-person
pose tracking. Additionally, this dataset provides a good benchmark for current
state-of-the-art methods on multi-frame person re-ID.
- PuRe
- BibTeX
412
Paper
D2
Y. Fan, Y. Xian, M. M. Losch, and B. Schiele
“Analyzing the Dependency of ConvNets on Spatial Information,” 2020. [Online]. Available: https://arxiv.org/abs/2002.01827.
mehr
Abstract
Intuitively, image classification should profit from using spatial
information. Recent work, however, suggests that this might be overrated in
standard CNNs. In this paper, we are pushing the envelope and aim to further
investigate the reliance on spatial information. We propose spatial shuffling
and GAP+FC to destroy spatial information during both training and testing
phases. Interestingly, we observe that spatial information can be deleted from
later layers with small performance drops, which indicates spatial information
at later layers is not necessary for good performance. For example, test
accuracy of VGG-16 only drops by 0.03% and 2.66% with spatial information
completely removed from the last 30% and 53% layers on CIFAR100, respectively.
Evaluation on several object recognition datasets (CIFAR100, Small-ImageNet,
ImageNet) with a wide range of CNN architectures (VGG16, ResNet50, ResNet152)
shows an overall consistent pattern.
- PuRe
- BibTeX
413
Thesis
D2IMPR-CS
Y. He
“Improved Methods and Analysis for Semantic Image Segmentation,” Universität des Saarlandes, Saarbrücken, 2020.
mehr
Abstract
Modern deep learning has enabled amazing developments of computer vision in recent years (Hinton and Salakhutdinov, 2006; Krizhevsky et al., 2012). As a fundamental task, semantic segmentation aims to predict class labels for each pixel of images, which empowers machines perception of the visual world. In spite of recent successes of fully convolutional networks (Long etal., 2015), several challenges remain to be addressed. In this thesis, we focus on this topic, under different kinds of input formats and various types of scenes. Speciﬁcally, our study contains two aspects: (1) Data-driven neural modules for improved performance. (2) Leverage of datasets w.r.t.training systems with higher performances and better data privacy guarantees. In the ﬁrst part of this thesis, we improve semantic segmentation by designing new modules which are compatible with existing architectures. First, we develop a spatio-temporal data-driven pooling, which brings additional information of data (i.e. superpixels) into neural networks, beneﬁting the training of neural networks as well as the inference on novel data. We investigate our approach in RGB-D videos for segmenting indoor scenes, where depth provides complementary cues to colors and our model performs particularly well. Second, we design learnable dilated convolutions, which are the extension of standard dilated convolutions, whose dilation factors (Yu and Koltun, 2016) need to be carefully determined by hand to obtain decent performance. We present a method to learn dilation factors together with ﬁlter weights of convolutions to avoid a complicated search of dilation factors. We explore extensive studies on challenging street scenes, across various baselines with different complexity as well as several datasets at varying image resolutions. In the second part, we investigate how to utilize expensive training data. First, we start from the generative modelling and study the network architectures and the learning pipeline for generating multiple examples. We aim to improve the diversity of generated examples but also to preserve the comparable quality of the examples. Second, we develop a generative model for synthesizing features of a network. With a mixture of real images and synthetic features, we are able to train a segmentation model with better generalization capability. Our approach is evaluated on different scene parsing tasks to demonstrate the effectiveness of the proposed method. Finally, we study membership inference on the semantic segmentation task. We propose the ﬁrst membership inference attack system against black-box semantic segmentation models, that tries to infer if a data pair is used as training data or not. From our observations, information on training data is indeed leaking. To mitigate the leakage, we leverage our synthetic features to perform prediction obfuscations, reducing the posterior distribution gaps between a training and a testing set. Consequently, our study provides not only an approach for detecting illegal use of data, but also the foundations for a safer use of semantic segmentation models.
414
Thesis
D2IMPR-CSD4
E. Insafutdinov
“Towards Accurate Multi-Person Pose Estimation in the Wild,” Universität des Saarlandes, Saarbrücken, 2020.
415
Thesis
D2IMPR-CS
J.-H. Lange
“Multicut Optimization Guarantees & Geometry of Lifted Multicuts,” Universität des Saarlandes, Saarbrücken, 2020.
416
Thesis
D2IMPR-CS
P. Müller
“Sensing, Interpreting, and Anticipating Human Social Behaviour in the Real World,” Universität des Saarlandes, Saarbrücken, 2020.
417
Thesis
D2IMPR-CS
T. Orekondy
“Understanding and Controlling Leakage in Machine Learning,” Universität des Saarlandes, Saarbrücken, 2020.
418
Thesis
D2IMPR-CS
Y. Xian
“Learning from Limited Labeled Data - Zero-Shot and Few-Shot Learning,” Universität des Saarlandes, Saarbrücken, 2020.

2019

419
Article
D4D2
M. Habermann, W. Xu, M. Zollhöfer, G. Pons-Moll, and C. Theobalt
“LiveCap: Real-time Human Performance Capture from Monocular Video,” ACM Transactions on Graphics, vol. 38, no. 2, 2019.
420
Conference paper
D2
R. Corona, S. Alaniz, and Z. Akata
“Modeling Conceptual Understanding in Image Reference Games,” in Advances in Neural Information Processing Systems 32 (NeurIPS 2019), Vancouver, Canada, 2019.
- PuRe
- BibTeX
421
Conference paper
D2
V. Garcia Satorras, Z. Akata, and M. Welling
“Combining Generative and Discriminative Models for Hybrid Inference,” in Advances in Neural Information Processing Systems 32 (NeurIPS 2019), Vancouver, Canada, 2019.
- PuRe
- BibTeX
422
Conference paper
D2
X. Li, Q. Sun, Y. Liu, Q. Zhou, S. Zheng, T.-S. Chua, and B. Schiele
“Learning to Self-Train for Semi-Supervised Few-Shot Classification,” in Advances in Neural Information Processing Systems 32 (NeurIPS 2019), Vancouver, Canada, 2019.
- PuRe
- BibTeX
423
Book chapter / section
D2
A. Bulling and M. Wedel
“Everyday Eye Tracking for Real-World Consumer Behavior Analysis,” in A Handbook of Process Tracing Methods for Decision Research, 2nd ed., New York, NY: Taylor & Francis, 2019.
- PuRe
- BibTeX
424
Conference paper
D2
A. Bhattacharyya, M. Hanselmann, M. Fritz, B. Schiele, and C.-N. Straehle
“Conditional Flow Variational Autoencoders for Structured Sequence Prediction,” in Bayesian Deep Learning NeurIPS 2019 Workshop, Vancouver, Canada, 2019.
- PuRe
- BibTeX
425
Conference paper
D2
X. Zhang, Y. Sugano, and A. Bulling
“Evaluation of Appearance-Based Methods and Implications for Gaze-Based Applications,” in CHI 2019, CHI Conference on Human Factors in Computing Systems, Glasgow, UK, 2019.
426
Conference paper
D4D2
D. Mehta, O. Sotnychenko, F. Mueller, W. Xu, H.-P. Seidel, P. Fua, M. Elgharib, H. Rhodin, G. Pons-Moll, and C. Theobalt
“XNect Demo (v2): Real-time Multi-person 3D Human Pose Estimation with a Single RGB Camera,” in CVPR 2019 Demonstrations, Long Beach, CA, USA, 2019.
- PuRe
- BibTeX
427
Book chapter / section
D2
S. J. Oh, B. Schiele, and M. Fritz
“Towards Reverse-Engineering Black-Box Neural Networks,” in Explainable AI: Interpreting, Explaining and Visualizing Deep Learning, Berlin: Springer, 2019.
428
Article
D2
J. Steil, M. Tonsen, Y. Sugano, and A. Bulling
“InvisibleEye: Fully Embedded Mobile Eye Tracking Using Appearance-Based Gaze Estimation,” GetMobile, vol. 23, no. 2, 2019.
429
Conference paper
D2
P. Müller and A. Bulling
“Emergent Leadership Detection Across Datasets,” in ICMI ’19, International Conference on Multimodal Interaction, Suzhou, China, 2019.
mehr
Abstract
Automatic detection of emergent leaders in small groups from nonverbal
behaviour is a growing research topic in social signal processing but existing
methods were evaluated on single datasets -- an unrealistic assumption for
real-world applications in which systems are required to also work in settings
unseen at training time. It therefore remains unclear whether current methods
for emergent leadership detection generalise to similar but new settings and to
which extent. To overcome this limitation, we are the first to study a
cross-dataset evaluation setting for the emergent leadership detection task. We
provide evaluations for within- and cross-dataset prediction using two current
datasets (PAVIS and MPIIGroupInteraction), as well as an investigation on the
robustness of commonly used feature channels (visual focus of attention, body
pose, facial action units, speaking activity) and online prediction in the
cross-dataset setting. Our evaluations show that using pose and eye contact
based features, cross-dataset prediction is possible with an accuracy of 0.68,
as such providing another important piece of the puzzle towards emergent
leadership detection in the real world.
430
Conference paper
D2D4
T. Alldieck, M. A. Magnor, B. L. Bhatnagar, C. Theobalt, and G. Pons-Moll
“Learning to Reconstruct People in Clothing from a Single RGB Camera,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2019), Long Beach, CA, USA, 2019.
431
Conference paper
D2
A. Dutta and Z. Akata
“Semantically Tied Paired Cycle Consistency for Zero-Shot Sketch-based Image Retrieval,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2019), Long Beach, CA, USA, 2019.
432
Conference paper
D4D2
I. Habibie, W. Xu, D. Mehta, G. Pons-Moll, and C. Theobalt
“In the Wild Human Pose Estimation using Explicit 2D Features and Intermediate 3D Representations,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2019), Long Beach, CA, USA, 2019.
433
Conference paper
D2
Q. Ke, M. Fritz, and B. Schiele
“Time-Conditioned Action Anticipation in One Shot,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2019), Long Beach, CA, USA, 2019.
434
Conference paper
D2
J.-H. Lange, B. Andres, and P. Swoboda
“Combinatorial Persistency Criteria for Multicut and Max-Cut,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2019), Long Beach, CA, USA, 2019.
435
Conference paper
D2
T. Orekondy, B. Schiele, and M. Fritz
“Knockoff Nets: Stealing Functionality of Black-Box Models,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2019), Long Beach, CA, USA, 2019.
436
Conference paper
D2
E. Schönfeld, S. Ebrahimi, S. Sinha, T. Darrell, and Z. Akata
“Generalized Zero- and Few-Shot Learning via Aligned Variational Autoencoders,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2019), Long Beach, CA, USA, 2019.
437
Conference paper
D2
R. Shetty, B. Schiele, and M. Fritz
“Not Using the Car to See the Sidewalk: Quantifying and Controlling the Effects of Context in Classification and Segmentation,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2019), Long Beach, CA, USA, 2019.
438
Conference paper
D2
D. Stutz, M. Hein,, and B. Schiele
“Disentangling Adversarial Robustness and Generalization,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2019), Long Beach, CA, USA, 2019.
439
Conference paper
D2
Q. Sun, Y. Liu, T.-S. Chua, and B. Schiele
“Meta-Transfer Learning for Few-Shot Learning,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2019), Long Beach, CA, USA, 2019.
440
Conference paper
D2
P. Swoboda and V. Kolmogorov
“MAP Inference via Block-Coordinate Frank-Wolfe Algorithm,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2019), Long Beach, CA, USA, 2019.
mehr
Abstract
When labeled training data is scarce, a promising data augmentation approach is to generate visual features of unknown classes using their attributes. To learn the class conditional distribution of CNN features, these models rely on pairs of image features and class attributes. Hence, they can not make use of the abundance of unlabeled data samples. In this paper, we tackle any-shot learning problems i.e. zero-shot and few-shot, in a unified feature generating framework that operates in both inductive and transductive learning settings. We develop a conditional generative model that combines the strength of VAE and GANs and in addition, via an unconditional discriminator, learns the marginal feature distribution of unlabeled images. We empirically show that our model learns highly discriminative CNN features for five datasets, i.e. CUB, SUN, AWA and ImageNet, and establish a new state-of-the-art in any-shot learning, i.e. inductive and transductive (generalized) zero- and few-shot learning settings. We also demonstrate that our learned features are interpretable: we visualize them by inverting them back to the pixel space and we explain them by generating textual arguments of why they are associated with a certain label.
441
Conference paper
D2D4
P. Swoboda, D. Kainmüller, A. Mokarian, C. Theobalt, and F. Bernard
“A Convex Relaxation for Multi-Graph Matching,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2019), Long Beach, CA, USA, 2019.
442
Conference paper
D2
Y. Xian, S. Sharma, B. Schiele, and Z. Akata
“f-VAEGAN-D2: A Feature Generating Framework for Any-Shot Learning,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2019), Long Beach, CA, USA, 2019.
mehr
Abstract
When labeled training data is scarce, a promising data augmentation approach is to generate visual features of unknown classes using their attributes. To learn the class conditional distribution of CNN features, these models rely on pairs of image features and class attributes. Hence, they can not make use of the abundance of unlabeled data samples. In this paper, we tackle any-shot learning problems i.e. zero-shot and few-shot, in a unified feature generating framework that operates in both inductive and transductive learning settings. We develop a conditional generative model that combines the strength of VAE and GANs and in addition, via an unconditional discriminator, learns the marginal feature distribution of unlabeled images. We empirically show that our model learns highly discriminative CNN features for five datasets, i.e. CUB, SUN, AWA and ImageNet, and establish a new state-of-the-art in any-shot learning, i.e. inductive and transductive (generalized) zero- and few-shot learning settings. We also demonstrate that our learned features are interpretable: we visualize them by inverting them back to the pixel space and we explain them by generating textual arguments of why they are associated with a certain label.
443
Conference paper
D2
Y. Xian, S. Choudhury, Y. He, B. Schiele, and Z. Akata
“Semantic Projection Network for Zero- and Few-Label Semantic Segmentation,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2019), Long Beach, CA, USA, 2019.
- PuRe
- BibTeX
444
Conference paper
D2
N. Yu, C. Barnes, E. Shechtman, S. Amirghodsi, and M. Lukáč
“Texture Mixer: A Network for Controllable Synthesis and Interpolation of Texture,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2019), Long Beach, CA, USA, 2019.
445
Conference paper
D2D4
T. Yu, Z. Zheng, Y. Zhong, J. Zhao, D. Quionhai, G. Pons-Moll, and Y. Liu
“SimulCap : Single-View Human Performance Capture with Cloth Simulation,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2019), Long Beach, CA, USA, 2019.
446
Conference paper
D2
S. Abdelnabi, M. X. Huang, and A. Bulling
“Towards High-Frequency SSVEP-Based Target Discrimination with an Extended Alphanumeric Keyboard,” in IEEE International Conference on Systems, Man, and Cybernetics (SMC 2019), Bari, Italy, 2019.
447
Article
D2
Y. Xian, C. H. Lampert, B. Schiele, and Z. Akata
“Zero-shot Learning - A Comprehensive Evaluation of the Good, the Bad and the Ugly,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 41, no. 9, 2019.
mehr
Abstract
Due to the importance of zero-shot learning, i.e. classifying images where
there is a lack of labeled training data, the number of proposed approaches has
recently increased steadily. We argue that it is time to take a step back and
to analyze the status quo of the area. The purpose of this paper is three-fold.
First, given the fact that there is no agreed upon zero-shot learning
benchmark, we first define a new benchmark by unifying both the evaluation
protocols and data splits of publicly available datasets used for this task.
This is an important contribution as published results are often not comparable
and sometimes even flawed due to, e.g. pre-training on zero-shot test classes.
Moreover, we propose a new zero-shot learning dataset, the Animals with
Attributes 2 (AWA2) dataset which we make publicly available both in terms of
image features and the images themselves. Second, we compare and analyze a
significant number of the state-of-the-art methods in depth, both in the
classic zero-shot setting but also in the more realistic generalized zero-shot
setting. Finally, we discuss in detail the limitations of the current status of
the area which can be taken as a basis for advancing it.
448
Article
D2
X. Zhang, Y. Sugano, M. Fritz, and A. Bulling
“MPIIGaze: Real-World Dataset and Deep Appearance-Based Gaze Estimation,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 41, no. 1, 2019.
449
Conference paper
D2
H. Sattar, G. Pons-Moll, and M. Fritz
“Fashion is Taking Shape: Understanding Clothing Preference Based on Body Shape From Online Sources,” in 2019 IEEE Winter Conference on Applications of Computer Vision (WACV 2019), Waikoloa Village, HI, USA, 2019.
450
Conference paper
D2
V. Lazova, E. Insafutdinov, and G. Pons-Moll
“360-Degree Textures of People in Clothing from a Single Image,” in International Conference on 3D Vision, Québec City, Canada, 2019.
451
Conference paper
D2
A. Abbas and P. Swoboda
“Bottleneck Potentials in Markov Random Fields,” in International Conference on Computer Vision (ICCV 2019), Seoul, Korea, 2019.
452
Conference paper
D2D4
T. Alldieck, G. Pons-Moll, C. Theobalt, and M. A. Magnor
“Tex2Shape: Detailed Full Human Body Geometry from a Single Image,” in International Conference on Computer Vision (ICCV 2019), Seoul, Korea, 2019.
mehr
Abstract
We present a simple yet effective method to infer detailed full human body
shape from only a single photograph. Our model can infer full-body shape
including face, hair, and clothing including wrinkles at interactive
frame-rates. Results feature details even on parts that are occluded in the
input image. Our main idea is to turn shape regression into an aligned
image-to-image translation problem. The input to our method is a partial
texture map of the visible region obtained from off-the-shelf methods. From a
partial texture, we estimate detailed normal and vector displacement maps,
which can be applied to a low-resolution smooth body model to add detail and
clothing. Despite being trained purely with synthetic data, our model
generalizes well to real-world photographs. Numerous results demonstrate the
versatility and robustness of our method.
453
Conference paper
D4D2
F. Bernard, J. Thunberg, P. Swoboda, and C. Theobalt
“HiPPI: Higher-Order Projected Power Iterations for Scalable Multi-Matching,” in International Conference on Computer Vision (ICCV 2019), Seoul, Korea, 2019.
454
Conference paper
D2D4
B. L. Bhatnagar, G. Tiwari, C. Theobalt, and G. Pons-Moll
“Multi-Garment Net: Learning to Dress 3D People from Images,” in International Conference on Computer Vision (ICCV 2019), Seoul, Korea, 2019.
455
Conference paper
D2
N. Mahmood, N. Ghorbani, N. F. Troje, G. Pons-Moll, and M. J. Black
“AMASS: Archive of Motion Capture as Surface Shapes,” in International Conference on Computer Vision (ICCV 2019), Seoul, Korea, 2019.
456
Conference paper
D2
S. Sharma, P. T. Varigonda, P. Bindal, A. Sharma, and A. Jain
“Monocular 3D Human Pose Estimation by Generation and Ordinal Ranking,” in International Conference on Computer Vision (ICCV 2019), Seoul, Korea, 2019.
457
Conference paper
D2
N. Yu, L. Davis, and M. Fritz
“Attributing Fake Images to GANs: Learning and Analyzing GAN Fingerprints,” in International Conference on Computer Vision (ICCV 2019), Seoul, Korea, 2019.
458
Conference paper
D2
A. Bhattacharyya, M. Fritz, and B. Schiele
“Bayesian Prediction of Future Street Scenes using Synthetic Likelihoods,” in International Conference on Learning Representations (ICLR 2019), New Orleans, LA, USA, 2019.
- PuRe
- BibTeX
459
Article
D2
A. Khoreva, R. Benenson, E. Ilg, T. Brox, and B. Schiele
“Lucid Data Dreaming for Video Object Segmentation,” International Journal of Computer Vision, vol. 127, no. 9, 2019.
460
Conference paper
D2
M. X. Huang, J. Li, G. Ngai, H. V. Leong, and A. Bulling
“Moment-to-Moment Detection of Internal Thought from Eye Vergence Behaviour,” in MM ’19, 27th ACM International Conference on Multimedia, Nice, France, 2019.
mehr
Abstract
Internal thought refers to the process of directing attention away from a
primary visual task to internal cognitive processing. Internal thought is a
pervasive mental activity and closely related to primary task performance. As
such, automatic detection of internal thought has significant potential for
user modelling in intelligent interfaces, particularly for e-learning
applications. Despite the close link between the eyes and the human mind, only
a few studies have investigated vergence behaviour during internal thought and
none has studied moment-to-moment detection of internal thought from gaze.
While prior studies relied on long-term data analysis and required a large
number of gaze characteristics, we describe a novel method that is
computationally light-weight and that only requires eye vergence information
that is readily available from binocular eye trackers. We further propose a
novel paradigm to obtain ground truth internal thought annotations that
exploits human blur perception. We evaluate our method for three increasingly
challenging detection tasks: (1) during a controlled math-solving task, (2)
during natural viewing of lecture videos, and (3) during daily activities, such
as coding, browsing, and reading. Results from these evaluations demonstrate
the performance and robustness of vergence-based detection of internal thought
and, as such, open up new directions for research on interfaces that adapt to
shifts of mental attention.
461
Conference paper
D2
X. Hong, E. Chang, and V. Demberg
“Improving Language Generation from Feature-Rich Tree-Structured Data with Relational Graph Convolutional Encoders,” in Multilingual Surface Realisation (MSR 2019), Hong Kong, China, 2019.
462
Conference paper
D2
M. X. Huang and A. Bulling
“SacCalib: Reducing Calibration Distortion for Stationary Eye Trackers Using Saccadic Eye Movements,” in Proceedings ETRA 2019, Denver, CO, USA, 2019.
mehr
Abstract
Recent methods to automatically calibrate stationary eye trackers were shown
to effectively reduce inherent calibration distortion. However, these methods
require additional information, such as mouse clicks or on-screen content. We
propose the first method that only requires users' eye movements to reduce
calibration distortion in the background while users naturally look at an
interface. Our method exploits that calibration distortion makes straight
saccade trajectories appear curved between the saccadic start and end points.
We show that this curving effect is systematic and the result of distorted gaze
projection plane. To mitigate calibration distortion, our method undistorts
this plane by straightening saccade trajectories using image warping. We show
that this approach improves over the common six-point calibration and is
promising for reducing distortion. As such, it provides a non-intrusive
solution to alleviating accuracy decrease of eye tracker during long-term use.
463
Conference paper
D2
P. Müller, D. Buschek, M. X. Huang, and A. Bulling
“Reducing Calibration Drift in Mobile Eye Trackers by Exploiting Mobile Phone Usage,” in Proceedings ETRA 2019, Denver, CO, USA, 2019.
464
Conference paper
D2
J. Steil, I. Hagestedt, M. X. Huang, and A. Bulling
“Privacy-Aware Eye Tracking Using Differential Privacy,” in Proceedings ETRA 2019, Denver, CO, USA, 2019.
465
Conference paper
D2
J. Steil, M. Koelle, W. Heuten, S. Boll, and A. Bulling
“PrivacEye: Privacy-Preserving Head-Mounted Eye Tracking Using Egocentric Scene Image and Eye Movement Features,” in Proceedings ETRA 2019, Denver, CO, USA, 2019.
466
Conference paper
D2
J. Wang, E. Y. Fu, G. Ngai, H. Va Leong, and M. X. Huang
“Detecting Stress from Mouse-Gaze Attraction,” in Proceedings of the 34th ACM/SIGAPP Symposium on Applied Computing (SAC 2019), Limassol, Cyprus, 2019.
467
Conference paper
D2
T. Orekondy, S. J. Oh, Y. Zhang, B. Schiele, and M. Fritz
“Gradient-Leaks: Understanding Deanonymization in Federated Learning,” in The 2nd International Workshop on Federated Learning for Data Privacy and Confidentiality (FL-NeurIPS 2019), Vancouver, Canada, 2019.
- PuRe
- BibTeX
468
Paper
D2
A. Abbas and P. Swoboda
“Bottleneck Potentials in Markov Random Fields,” 2019. [Online]. Available: http://arxiv.org/abs/1904.08080.
mehr
Abstract
We consider general discrete Markov Random Fields(MRFs) with additional
bottleneck potentials which penalize the maximum (instead of the sum) over
local potential value taken by the MRF-assignment. Bottleneck potentials or
analogous constructions have been considered in (i) combinatorial optimization
(e.g. bottleneck shortest path problem, the minimum bottleneck spanning tree
problem, bottleneck function minimization in greedoids), (ii) inverse problems
with $L_{\infty}$-norm regularization, and (iii) valued constraint satisfaction
on the $(\min,\max)$-pre-semirings. Bottleneck potentials for general discrete
MRFs are a natural generalization of the above direction of modeling work to
Maximum-A-Posteriori (MAP) inference in MRFs. To this end, we propose MRFs
whose objective consists of two parts: terms that factorize according to (i)
$(\min,+)$, i.e. potentials as in plain MRFs, and (ii) $(\min,\max)$, i.e.
bottleneck potentials. To solve the ensuing inference problem, we propose
high-quality relaxations and efficient algorithms for solving them. We
empirically show efficacy of our approach on large scale seismic horizon
tracking problems.
- PuRe
- BibTeX
469
Paper
D2
A. Bhattacharyya, M. Fritz, and B. Schiele
“‘Best-of-Many-Samples’ Distribution Matching,” 2019. [Online]. Available: http://arxiv.org/abs/1909.12598.
mehr
Abstract
Generative Adversarial Networks (GANs) can achieve state-of-the-art sample
quality in generative modelling tasks but suffer from the mode collapse
problem. Variational Autoencoders (VAE) on the other hand explicitly maximize a
reconstruction-based data log-likelihood forcing it to cover all modes, but
suffer from poorer sample quality. Recent works have proposed hybrid VAE-GAN
frameworks which integrate a GAN-based synthetic likelihood to the VAE
objective to address both the mode collapse and sample quality issues, with
limited success. This is because the VAE objective forces a trade-off between
the data log-likelihood and divergence to the latent prior. The synthetic
likelihood ratio term also shows instability during training. We propose a
novel objective with a "Best-of-Many-Samples" reconstruction cost and a stable
direct estimate of the synthetic likelihood. This enables our hybrid VAE-GAN
framework to achieve high data log-likelihood and low divergence to the latent
prior at the same time and shows significant improvement over both hybrid
VAE-GANS and plain GANs in mode coverage and quality.
- PuRe
- BibTeX
470
Paper
D2
Y. Liu, Q. Sun, A.-A. Liu, Y. Su, B. Schiele, and T.-S. Chua
“LCC: Learning to Customize and Combine Neural Networks for Few-Shot Learning,” 2019. [Online]. Available: http://arxiv.org/abs/1904.08479.
mehr
Abstract
Meta-learning has been shown to be an effective strategy for few-shot
learning. The key idea is to leverage a large number of similar few-shot tasks
in order to meta-learn how to best initiate a (single) base-learner for novel
few-shot tasks. While meta-learning how to initialize a base-learner has shown
promising results, it is well known that hyperparameter settings such as the
learning rate and the weighting of the regularization term are important to
achieve best performance. We thus propose to also meta-learn these
hyperparameters and in fact learn a time- and layer-varying scheme for learning
a base-learner on novel tasks. Additionally, we propose to learn not only a
single base-learner but an ensemble of several base-learners to obtain more
robust results. While ensembles of learners have shown to improve performance
in various settings, this is challenging for few-shot learning tasks due to the
limited number of training samples. Therefore, our approach also aims to
meta-learn how to effectively combine several base-learners. We conduct
extensive experiments and report top performance for five-class few-shot
recognition tasks on two challenging benchmarks: miniImageNet and
Fewshot-CIFAR100 (FC100).
- PuRe
- BibTeX
471
Paper
D2
W. Li, A. Leonardis, J. Bohg, and M. Fritz
“Learning Manipulation under Physics Constraints with Visual Perception,” 2019. [Online]. Available: http://arxiv.org/abs/1904.09860.
mehr
Abstract
Understanding physical phenomena is a key competence that enables humans and
animals to act and interact under uncertain perception in previously unseen
environments containing novel objects and their configurations. In this work,
we consider the problem of autonomous block stacking and explore solutions to
learning manipulation under physics constraints with visual perception inherent
to the task. Inspired by the intuitive physics in humans, we first present an
end-to-end learning-based approach to predict stability directly from
appearance, contrasting a more traditional model-based approach with explicit
3D representations and physical simulation. We study the model's behavior
together with an accompanied human subject test. It is then integrated into a
real-world robotic system to guide the placement of a single wood block into
the scene without collapsing existing tower structure. To further automate the
process of consecutive blocks stacking, we present an alternative approach
where the model learns the physics constraint through the interaction with the
environment, bypassing the dedicated physics learning as in the former part of
this work. In particular, we are interested in the type of tasks that require
the agent to reach a given goal state that may be different for every new
trial. Thereby we propose a deep reinforcement learning framework that learns
policies for stacking tasks which are parametrized by a target structure.
- PuRe
- BibTeX
472
Paper
D2
M. Losch, M. Fritz, and B. Schiele
“Interpretability Beyond Classification Output: Semantic Bottleneck Networks,” 2019. [Online]. Available: http://arxiv.org/abs/1907.10882.
mehr
Abstract
Today's deep learning systems deliver high performance based on end-to-end
training. While they deliver strong performance, these systems are hard to
interpret. To address this issue, we propose Semantic Bottleneck Networks
(SBN): deep networks with semantically interpretable intermediate layers that
all downstream results are based on. As a consequence, the analysis on what the
final prediction is based on is transparent to the engineer and failure cases
and modes can be analyzed and avoided by high-level reasoning. We present a
case study on street scene segmentation to demonstrate the feasibility and
power of SBN. In particular, we start from a well performing classic deep
network which we adapt to house a SB-Layer containing task related semantic
concepts (such as object-parts and materials). Importantly, we can recover
state of the art performance despite a drastic dimensionality reduction from
1000s (non-semantic feature) to 10s (semantic concept) channels. Additionally
we show how the activations of the SB-Layer can be used for both the
interpretation of failure cases of the network as well as for confidence
prediction of the resulting output. For the first time, e.g., we show
interpretable segmentation results for most predictions at over 99% accuracy.
- PuRe
- BibTeX
473
Paper
D2
L. Ma, Q. Sun, B. Schiele, and L. Van Gool
“A Novel BiLevel Paradigm for Image-to-Image Translation,” 2019. [Online]. Available: http://arxiv.org/abs/1904.09028.
mehr
Abstract
Image-to-image (I2I) translation is a pixel-level mapping that requires a
large number of paired training data and often suffers from the problems of
high diversity and strong category bias in image scenes. In order to tackle
these problems, we propose a novel BiLevel (BiL) learning paradigm that
alternates the learning of two models, respectively at an instance-specific
(IS) and a general-purpose (GP) level. In each scene, the IS model learns to
maintain the specific scene attributes. It is initialized by the GP model that
learns from all the scenes to obtain the generalizable translation knowledge.
This GP initialization gives the IS model an efficient starting point, thus
enabling its fast adaptation to the new scene with scarce training data. We
conduct extensive I2I translation experiments on human face and street view
datasets. Quantitative results validate that our approach can significantly
boost the performance of classical I2I translation models, such as PG2 and
Pix2Pix. Our visualization results show both higher image quality and more
appropriate instance-specific details, e.g., the translated image of a person
looks more like that person in terms of identity.
- PuRe
- BibTeX
474
Paper
D4D2
D. Mehta, O. Sotnychenko, F. Mueller, W. Xu, M. Elgharib, P. Fua, H.-P. Seidel, H. Rhodin, G. Pons-Moll, and C. Theobalt
“XNect: Real-time Multi-person 3D Human Pose Estimation with a Single RGB Camera,” 2019. [Online]. Available: http://arxiv.org/abs/1907.00837.
mehr
Abstract
We present a real-time approach for multi-person 3D motion capture at over 30
fps using a single RGB camera. It operates in generic scenes and is robust to
difficult occlusions both by other people and objects. Our method operates in
subsequent stages. The first stage is a convolutional neural network (CNN) that
estimates 2D and 3D pose features along with identity assignments for all
visible joints of all individuals. We contribute a new architecture for this
CNN, called SelecSLS Net, that uses novel selective long and short range skip
connections to improve the information flow allowing for a drastically faster
network without compromising accuracy. In the second stage, a fully-connected
neural network turns the possibly partial (on account of occlusion) 2D pose and
3D pose features for each subject into a complete 3D pose estimate per
individual. The third stage applies space-time skeletal model fitting to the
predicted 2D and 3D pose per subject to further reconcile the 2D and 3D pose,
and enforce temporal coherence. Our method returns the full skeletal pose in
joint angles for each subject. This is a further key distinction from previous
work that neither extracted global body positions nor joint angle results of a
coherent skeleton in real time for multi-person scenes. The proposed system
runs on consumer hardware at a previously unseen speed of more than 30 fps
given 512x320 images as input while achieving state-of-the-art accuracy, which
we will demonstrate on a range of challenging real-world scenes.
- PuRe
- BibTeX
475
Paper
D2
H. Sattar, K. Krombholz, G. Pons-Moll, and M. Fritz
“Shape Evasion: Preventing Body Shape Inference of Multi-Stage Approaches,” 2019. [Online]. Available: http://arxiv.org/abs/1905.11503.
mehr
Abstract
Modern approaches to pose and body shape estimation have recently achieved
strong performance even under challenging real-world conditions. Even from a
single image of a clothed person, a realistic looking body shape can be
inferred that captures a users' weight group and body shape type well. This
opens up a whole spectrum of applications -- in particular in fashion -- where
virtual try-on and recommendation systems can make use of these new and
automatized cues. However, a realistic depiction of the undressed body is
regarded highly private and therefore might not be consented by most people.
Hence, we ask if the automatic extraction of such information can be
effectively evaded. While adversarial perturbations have been shown to be
effective for manipulating the output of machine learning models -- in
particular, end-to-end deep learning approaches -- state of the art shape
estimation methods are composed of multiple stages. We perform the first
investigation of different strategies that can be used to effectively
manipulate the automatic shape estimation while preserving the overall
appearance of the original image.
- PuRe
- BibTeX
476
Thesis
D2IMPR-CS
H. Sattar
“Intents and Preferences Prediction Based on Implicit Human Cues,” Universität des Saarlandes, Saarbrücken, 2019.
mehr
Abstract
Visual search is an important task, and it is part of daily human life. Thus, it has been a long-standing goal in Computer Vision to develop methods aiming at analysing human search intent and preferences. As the target of the search only exists in mind of the person, search intent prediction remains challenging for machine perception. In this thesis, we focus on advancing techniques for search target and preference prediction from implicit human cues. First, we propose a search target inference algorithm from human fixation data recorded during visual search. In contrast to previous work that has focused on individual instances as a search target in a closed world, we propose the first approach to predict the search target in open-world settings by learning the compatibility between observed fixations and potential search targets. Second, we further broaden the scope of search target prediction to categorical classes, such as object categories and attributes. However, state of the art models for categorical recognition, in general, require large amounts of training data, which is prohibitive for gaze data. To address this challenge, we propose a novel Gaze Pooling Layer that integrates gaze information into CNN-based architectures as an attention mechanism – incorporating both spatial and temporal aspects of human gaze behaviour. Third, we go one step further and investigate the feasibility of combining our gaze embedding approach, with the power of generative image models to visually decode, i.e. create a visual representation of, the search target. Forth, for the first time, we studied the effect of body shape on people preferences of outfits. We propose a novel and robust multi-photo approach to estimate the body shapes of each user and build a conditional model of clothing categories given body-shape. We demonstrate that in real-world data, clothing categories and body-shapes are correlated. We show that our approach estimates a realistic looking body shape that captures a user’s weight group and body shape type, even from a single image of a clothed person. However, an accurate depiction of the naked body is considered highly private and therefore, might not be consented by most people. First, we studied the perception of such technology via a user study. Then, in the last part of this thesis, we ask if the automatic extraction of such information can be effectively evaded. In summary, this thesis addresses several different tasks that aims to enable the vision system to analyse human search intent and preferences in real-world scenarios. In particular, the thesis proposes several novel ideas and models in visual search target prediction from human fixation data, for the first time studied the correlation between shape and clothing categories opening a new direction in clothing recommendation systems, and introduces a new topic in privacy and computer vision, aimed at preventing automatic 3D shape extraction from images.
477
Thesis
D2IMPR-CS
J. Steil
“Mobile Eye Tracking for Everyone,” Universität des Saarlandes, Saarbrücken, 2019.
mehr
Abstract
Eye tracking and gaze-based human-computer interfaces have become a practical modality in desktop settings, since remote eye tracking is efficient and affordable. However, remote eye tracking remains constrained to indoor, laboratory-like conditions, in which lighting and user position need to be controlled. Mobile eye tracking has the potential to overcome these limitations and to allow people to move around freely and to use eye tracking on a daily basis during their everyday routine. However, mobile eye tracking currently faces two fundamental challenges that prevent it from being practically usable and that, consequently, have to be addressed before mobile eye tracking can truly be used by everyone: Mobile eye tracking needs to be advanced and made fully functional in unconstrained environments, and it needs to be made socially acceptable. Numerous sensing and analysis methods were initially developed for remote eye tracking and have been successfully applied for decades. Unfortunately, these methods are limited in terms of functionality and correctness, or even unsuitable for application in mobile eye tracking. Therefore, the majority of fundamental definitions, eye tracking methods, and gaze estimation approaches cannot be borrowed from remote eye tracking without adaptation. For example, the definitions of specific eye movements, like classical fixations, need to be extended to mobile settings where natural user and head motion are omnipresent. Corresponding analytical methods need to be adjusted or completely reimplemented based on novel approaches encoding the human gaze behaviour. Apart from these technical challenges, an entirely new, and yet under-explored, topic required for the breakthrough of mobile eye tracking as everyday technology is the overcoming of social obstacles. A first crucial key issue to defuse social objections is the building of acceptance towards mobile eye tracking. Hence, it is essential to replace the bulky appearance of current head-mounted eye trackers with an unobtrusive, appealing, and trendy design. The second high-priority theme of increasing importance for everyone is privacy and its protection, given that research and industry have not focused on or taken care of this problem at all. To establish true confidence, future devices have to find a fine balance between protecting users’ and bystanders’ privacy and attracting and convincing users of their necessity, utility, and potential with useful and beneficial features. The solution of technical challenges and social obstacles is the prerequisite for the development of a variety of novel and exciting applications in order to establish mobile eye tracking as a new paradigm, which ease our everyday life. This thesis addresses core technical challenges of mobile eye tracking that currently prevent it from being widely adopted. Specifically, this thesis proves that 3D data used for the calibration of mobile eye trackers improves gaze estimation and significantly reduces the parallax error. Further, it presents the first effective fixation detection method for head-mounted devices that is robust against the prevalence of user and gaze target motion. In order to achieve social acceptability, this thesis proposes an innovative and unobtrusive design for future mobile eye tracking devices and builds the first prototype with fully frame-embedded eye cameras combined with a calibration-free deep-trained appearance-based gaze estimation approach. To protect users’ and bystanders’ privacy in the presence of head-mounted eye trackers, this thesis presents another first-of-its-kind prototype. It is able to identify privacy-sensitive situations to automatically enable and disable the eye tracker’s first-person camera by means of a mechanical shutter, leveraging the combination of deep scene and eye movement features. Nevertheless, solving technical challenges and social obstacles alone is not sufficient to make mobile eye tracking attractive for the masses. The key to success is the development of convincingly useful, innovative, and essential applications. To extend the protection of users’ privacy on the software side as well, this thesis presents the first privacy-aware VR gaze interface using differential privacy. This method adds noise to recorded eye tracking data so that privacy-sensitive information like a user’s gender or identity is protected without impeding the utility of the data itself. In addition, the first large-scale online survey is conducted to understand users’ concerns with eye tracking. To develop and evaluate novel applications, this thesis presents the first publicly available long-term eye tracking datasets. They are used to show the unsupervised detection of users’ activities from eye movements alone using novel and efficient video-based encoding approaches as well as to propose the first proof-of-concept method to forecast users’ attentive behaviour during everyday mobile interactions from phone-integrated and body-worn sensors. This opens up possibilities for the development of a variety of novel and exciting applications. With more advanced features, accompanied by technological progress and sensor miniaturisation, eye tracking is increasingly integrated into conventional glasses as well as virtual and augmented reality (VR/AR) head-mounted displays, becoming an integral component of mobile interfaces. This thesis paves the way for the development of socially acceptable, privacy-aware, but highly functional mobile eye tracking devices and novel applications, so that mobile eye tracking can develop its full potential to become an everyday technology for everyone.
478
Paper
D2
D. Stutz, M. Hein, and B. Schiele
“Confidence-Calibrated Adversarial Training and Detection: More Robust Models Generalizing Beyond the Attack Used During Training,” 2019. [Online]. Available: http://arxiv.org/abs/1910.06259.
mehr
Abstract
Adversarial training is the standard to train models robust against
adversarial examples. However, especially for complex datasets, adversarial
training incurs a significant loss in accuracy and is known to generalize
poorly to stronger attacks, e.g., larger perturbations or other threat models.
In this paper, we introduce confidence-calibrated adversarial training (CCAT)
where the key idea is to enforce that the confidence on adversarial examples
decays with their distance to the attacked examples. We show that CCAT
preserves better the accuracy of normal training while robustness against
adversarial examples is achieved via confidence thresholding, i.e., detecting
adversarial examples based on their confidence. Most importantly, in strong
contrast to adversarial training, the robustness of CCAT generalizes to larger
perturbations and other threat models, not encountered during training. For
evaluation, we extend the commonly used robust test error to our detection
setting, present an adaptive attack with backtracking and allow the attacker to
select, per test example, the worst-case adversarial example from multiple
black- and white-box attacks. We present experimental results using $L_\infty$,
$L_2$, $L_1$ and $L_0$ attacks on MNIST, SVHN and Cifar10.
- PuRe
- BibTeX

2018

479
Conference paper
D4D2
E. Tretschk, S. J. Oh, and M. Fritz
“Sequential Attacks on Agents for Long-Term Adversarial Goals,” in 2. ACM Computer Science in Cars Symposium (CSCS 2018), Munich, Germany, 2018.
- PuRe
- BibTeX
480
Conference paper
D2D4
T. Alldieck, M. A. Magnor, W. Xu, C. Theobalt, and G. Pons-Moll
“Detailed Human Avatars from Monocular Video,” in 3DV 2018 , International Conference on 3D Vision, Verona, Italy, 2018.
481
Conference paper
D4D2
D. Mehta, O. Sotnychenko, F. Mueller, W. Xu, S. Sridhar, G. Pons-Moll, and C. Theobalt
“Single-Shot Multi-person 3D Pose Estimation from Monocular RGB,” in 3DV 2018 , International Conference on 3D Vision, Verona, Italy, 2018.
482
Conference paper
D2
M. Omran, C. Lassner,, G. Pons-Moll, P. Gehler, and B. Schiele
“Neural Body Fitting: Unifying Deep Learning and Model Based Human Pose and Shape Estimation,” in 3DV 2018 , International Conference on 3D Vision, Verona, Italy, 2018.
483
Article
D2
Y. Huang, M. Kaufmann, E. Aksan, M. J. Black, O. Hilliges, and G. Pons-Moll
“Deep Inertial Poser: Learning to Reconstruct Human Pose from Sparse Inertial Measurements in Real Time,” ACM Transactions on Graphics (Proc. ACM SIGGRAPH Asia 2018), vol. 37, no. 6, 2018.
484
Article
D2
M. X. Huang, J. Li, G. Ngai, and H. Va Leong
“Quick Bootstrapping of a Personalized Gaze Model from Real-Use Interactions,” ACM Transactions on Intelligent Systems and Technology, vol. 9, no. 4, 2018.
485
Conference paper
D2
E. Insafutdinov and A. Dosovitskiy
“Unsupervised Learning of Shape and Pose with Differentiable Point Clouds,” in Advances in Neural Information Processing Systems 31 (NeurIPS 2018), Montréal, Canada, 2018.
- PuRe
- BibTeX
486
Conference paper
D2
R. Shetty, M. Fritz, and B. Schiele
“Adversarial Scene Editing: Automatic Object Removal from Weak Supervision,” in Advances in Neural Information Processing Systems 31 (NeurIPS 2018), Montréal, Canada, 2018.
mehr
Abstract
While great progress has been made recently in automatic image manipulation,
it has been limited to object centric images like faces or structured scene
datasets. In this work, we take a step towards general scene-level image
editing by developing an automatic interaction-free object removal model. Our
model learns to find and remove objects from general scene images using
image-level labels and unpaired data in a generative adversarial network (GAN)
framework. We achieve this with two key contributions: a two-stage editor
architecture consisting of a mask generator and image in-painter that
co-operate to remove objects, and a novel GAN based prior for the mask
generator that allows us to flexibly incorporate knowledge about object shapes.
We experimentally show on two datasets that our method effectively removes a
wide variety of objects using weak supervision only
- PuRe
- BibTeX
487
Conference paper
D2
M. Khamis, C. Oechsner, F. Alt, and A. Bulling
“VRPursuits: Interaction in Virtual Reality using Smooth Pursuit Eye Movements,” in AVI 2018, International Conference on Advanced Visual Interfaces, Grosseto, Italy, 2018.
488
Article
D2BIOD5
A. Horňáková, M. List, J. Vreeken, and M. H. Schulz
“JAMI: Fast Computation of Conditional Mutual Information for ceRNA Network Analysis,” Bioinformatics, vol. 34, no. 17, 2018.
489
Conference paper
D2
M. Khamis, A. Baier, N. Henze, F. Alt, and A. Bulling
“Understanding Face and Eye Visibility in Front-Facing Cameras of Smartphones used in the Wild,” in CHI 2018, CHI Conference on Human Factors in Computing Systems, Montréal, Canada, 2018.
490
Conference paper
D2
M. Khamis, C. Becker, A. Bulling, and F. Alt
“Which one is me? Identifying Oneself on Public Displays,” in CHI 2018, CHI Conference on Human Factors in Computing Systems, Montréal, Canada, 2018.
491
Conference paper
D2
X. Zhang, M. X. Huang, Y. Sugano, and A. Bulling
“Training Person-Specific Gaze Estimators from Interactions with Multiple Devices,” in CHI 2018, CHI Conference on Human Factors in Computing Systems, Montréal, Canada, 2018.
492
Article
D2
E. Wood, T. Baltrusaitis, L.-P. Morency, P. Robinson, and A. Bulling
“GazeDirector: Fully Articulated Eye Gaze Redirection in Video,” Computer Graphics Forum (Proc. EUROGRAPHICS 2018), vol. 37, no. 2, 2018.
493
Conference paper
D2
A. Khoreva, A. Rohrbach, and B. Schiele
“Video Object Segmentation with Language Referring Expressions,” in Computer Vision - ACCV 2018, Perth, Australia, 2019.
494
Conference paper
D2
L. Neumann, M. Karg, S. Zhang, C. Scharfenberger, E. Piegert, S. Mistr, O. Prokofyeva, R. Thiel, A. Vedaldi, A. Zisserman, and B. Schiele
“NightOwls: A Pedestrians at Night Dataset,” in Computer Vision - ACCV 2018, Perth, Australia, 2019.
495
Conference paper
D2
L. A. Hendricks, R. Hu, T. Darrell, and Z. Akata
“Grounding Visual Explanations,” in Computer Vision -- ECCV 2018, Munich, Germany, 2018.
496
Conference paper
D2
Y. He, B. Schiele, and M. Fritz
“Diverse Conditional Image Generation by Stochastic Regression with Latent Drop-Out Codes,” in Computer Vision -- ECCV 2018, Munich, Germany, 2018.
497
Conference paper
D2
J. Kim, A. Rohrbach, T. Darrell, J. Canny, and Z. Akata
“Textual Explanations for Self-Driving Vehicles,” in Computer Vision -- ECCV 2018, Munich, Germany, 2018.
mehr
Abstract
Deep neural perception and control networks have become key com-
ponents of self-driving vehicles. User acceptance is likely to benefit from easy-
to-interpret textual explanations which allow end-users to understand what trig-
gered a particular behavior. Explanations may be triggered by the neural con-
troller, namely
introspective explanations
, or informed by the neural controller’s
output, namely
rationalizations
. We propose a new approach to introspective ex-
planations which consists of two parts. First, we use a visual (spatial) attention
model to train a convolutional network end-to-end from images to the vehicle
control commands,
i
.
e
., acceleration and change of course. The controller’s at-
tention identifies image regions that potentially influence the network’s output.
Second, we use an attention-based video-to-text model to produce textual ex-
planations of model actions. The attention maps of controller and explanation
model are aligned so that explanations are grounded in the parts of the scene that
mattered to the controller. We explore two approaches to attention alignment,
strong- and weak-alignment. Finally, we explore a version of our model that
generates rationalizations, and compare with introspective explanations on the
same video segments. We evaluate these models on a novel driving dataset with
ground-truth human explanations, the Berkeley DeepDrive eXplanation (BDD-
X) dataset. Code is available at
github.com/JinkyuKimUCB/explainable-deep-driving
498
Conference paper
D2D4
Q. Sun, A. Tewari, W. Xu, M. Fritz, C. Theobalt, and B. Schiele
“A Hybrid Model for Identity Obfuscation by Face Replacement,” in Computer Vision -- ECCV 2018, Munich, Germany, 2018.
499
Conference paper
D2
T. von Marcard, R. Henschel, M. J. Black, B. Rosenhahn, and G. Pons-Moll
“Recovering Accurate {3D} Human Pose in the Wild Using {IMUs} and a Moving Camera,” in Computer Vision -- ECCV 2018, Munich, Germany, 2018.
500
Conference paper
D2
M. Wagner, H. Basevi, R. Shetty, W. Li, M. Malinowski, M. Fritz, and A. Leonardis
“Answering Visual What-If Questions: From Actions to Predicted Scene Descriptions,” in Computer Vision - ECCV 2018 Workshops, Munich, Germany, 2019.
501
Conference paper
D2
M. Khamis, A. Kienle, F. Alt, and A. Bulling
“GazeDrone: Mobile Eye-Based Interaction in Public Space Without Augmenting the User,” in DroNet’18, 4th ACM Workshop on Micro Aerial Vehicle Networks, Systems, and Applications, Munich, Germany, 2018.
502
Conference paper
D4D2
D. Mehta, O. Sotnychenko, F. Mueller, H. Rhodin, W. Xu, G. Pons-Moll, and C. Theobalt
“Demo of XNect: Real-time Multi-person 3D Human Pose Estimation with a Single RGB Camera,” in ECCV 2018 Demo Sessions, Munich, Germany, 2018.
- PuRe
- BibTeX
503
Conference paper
D2
N. Mukuze, A. Rohrbach, V. Demberg, and B. Schiele
“A Vision-grounded Dataset for Predicting Typical Locations for Verbs,” in Eleventh International Conference on Language Resources and Evaluation (LREC 2018), Miyazaki, Japan, 2018.
- PuRe
- BibTeX
504
Article
D2
S. Hoppe, T. Loetscher, S. Morey, and A. Bulling
“Eye Movements During Everyday Behavior Predict Personality Traits,” Frontiers in Human Neuroscience, vol. 12, 2018.
505
Conference paper
D2
H. Zhang and Q. Sun
“Objects, Relationships, and Context in Visual Data,” in ICMR’18, International Conference on Multimedia Retrieval, Yokohama, Japan, 2018.
506
Conference paper
D4D2
T. Alldieck, M. A. Magnor, W. Xu, C. Theobalt, and G. Pons-Moll
“Video Based Reconstruction of 3D People Models,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2018), Salt Lake City, UT, USA, 2018.
507
Conference paper
D2
M. Andriluka, U. Iqbal, A. Milan, E. Insafutdinov, L. Pishchulin, J. Gall, and B. Schiele
“PoseTrack: A Benchmark for Human Pose Estimation and Tracking,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2018), Salt Lake City, UT, USA, 2018.
508
Conference paper
D2
A. Bhattacharyya, M. Fritz, and B. Schiele
“Accurate and Diverse Sampling of Sequences based on a ‘Best of Many’ Sample Objective,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2018), Salt Lake City, UT, USA, 2018.
509
Conference paper
D2
A. Bhattacharyya, M. Fritz, and B. Schiele
“Long-Term On-Board Prediction of People in Traffic Scenes under Uncertainty,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2018), Salt Lake City, UT, USA, 2018.
510
Conference paper
D2
E. Laude, J.-H. Lange, J. Schüpfer, C. Domokos, L. Leal-Taixé, F. R. Schmidt, B. Andres, and D. Cremers
“Discrete-Continuous ADMM for Transductive Inference in Higher-Order MRFs,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2018), Salt Lake City, UT, USA, 2018.
511
Conference paper
D2
L. Ma, Q. Sun, S. Georgoulis, L. Van Gool, B. Schiele, and M. Fritz
“Disentangled Person Image Generation,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2018), Salt Lake City, UT, USA, 2018.
512
Conference paper
D2
T. Orekondy, M. Fritz, and B. Schiele
“Connecting Pixels to Privacy and Utility: Automatic Redaction of Private Information in Images,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2018), Salt Lake City, UT, USA, 2018.
513
Conference paper
D2
D. H. Park, L. A. Hendricks, Z. Akata, A. Rohrbach, B. Schiele, T. Darrell, and M. Rohrbach
“Multimodal Explanations: Justifying Decisions and Pointing to the Evidence,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2018), Salt Lake City, UT, USA, 2018.
514
Conference paper
D2
D. Stutz and A. Geiger
“Learning 3D Shape Completion from Laser Scan Data with Weak Supervision,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2018), Salt Lake City, UT, USA, 2018.
515
Conference paper
D2
Q. Sun, L. Ma, S. J. Oh, L. Van Gool, B. Schiele, and M. Fritz
“Natural and Effective Obfuscation by Head Inpainting,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2018), Salt Lake City, UT, USA, 2018.
516
Conference paper
D2
Y. Xian, T. Lorenz, B. Schiele, and Z. Akata
“Feature Generating Networks for Zero-Shot Learning,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2018), Salt Lake City, UT, USA, 2018.
517
Conference paper
D2
X. Xu, X. Chen, C. Liu, A. Rohrbach, T. Darrell, and D. Song
“Fooling Vision and Language Models Despite Localization and Attention Mechanism,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2018), Salt Lake City, UT, USA, 2018.
518
Conference paper
D2
T. Yu, Z. Zheng, K. Guo, J. Zhao, Q. Dai, H. Li, G. Pons-Moll, and Y. Liu
“DoubleFusion: Real-time Capture of Human Performances with Inner Body Shapes from a Single Depth Sensor,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2018), Salt Lake City, UT, USA, 2018.
519
Conference paper
D2
S. Zhang, J. Yang, and B. Schiele
“Occluded Pedestrian Detection through Guided Attention in CNNs,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2018), Salt Lake City, UT, USA, 2018.
520
Conference paper
D2
M. Fieraru, A. Khoreva, L. Pishchulin, and B. Schiele
“Learning to Refine Human Pose Estimation,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW 2018), Salt Lake City, UT, USA, 2018.
521
Article
D2
R. Shetty, H. R. Tavakoli, and J. Laaksonen
“Image and Video Captioning with Augmented Neural Architectures,” IEEE MultiMedia, vol. 25, no. 2, 2018.
522
Article
D2
M. X. Huang, J. Li, G. Ngai, H. V. Leong, and K. A. Hua
“Fast-PADMA: Rapidly Adapting Facial Affect Model from Similar Individuals,” IEEE Transactions on Multimedia, vol. 20, no. 7, 2018.
523
Article
D2
S. Georgoulis, K. Rematas, T. Ritschel, E. Gavves, M. Fritz, L. Van Gool, and T. Tuytelaars
“Reflectance and Natural Illumination from Single-Material Specular Objects Using Deep Learning,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 40, no. 8, 2018.
524
Article
D2
M. Lapin, M. Hein, and B. Schiele
“Analysis and Optimization of Loss Functions for Multiclass, Top-k, and Multilabel Classification,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 40, no. 7, 2018.
525
Article
D2
K. Sikka and G. Sharma
“Discriminatively Trained Latent Ordinal Model for Video Classification,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 40, no. 8, 2018.
526
Article
D2
S. Zhang, R. Benenson, M. Omran, J. Hosang, and B. Schiele
“Towards Reaching Human Performance in Pedestrian Detection,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 40, no. 4, 2018.
mehr
Abstract
Encouraged by the recent progress in pedestrian detection, we investigate the gap between current state-of-the-art methods
and the “perfect single frame detector”. We enable our analysis by creating a human baseline for pedestrian detection (over the Caltech
pedestrian dataset). After manually clustering the frequent errors of a top detector, we characterise both localisation and background-
versus-foreground errors.
To address localisation errors we study the impact of training annotation noise on the detector performance, and show that we can
improve results even with a small portion of sanitised training data. To address background/foreground discrimination, we study convnets
for pedestrian detection, and discuss which factors affect their performance.
Other than our in-depth analysis, we report top performance on the Caltech pedestrian dataset, and provide a new sanitised set of
training and test annotations.
527
Article
D2
D. Stutz and A. Geiger
“Learning 3D Shape Completion under Weak Supervision,” International Journal of Computer Vision, vol. 128, 2018.
528
Article
D2
S. Yeung, O. Russakovsky, N. Jin, M. Andriluka, G. Mori, and L. Fei-Fei
“Every Moment Counts: Dense Detailed Labeling of Actions in Complex Videos,” International Journal of Computer Vision, vol. 126, no. 2–4, 2018.
529
Conference paper
D2
T. C. K. Kwok, E. Y. Fu, E. Y. Wu, M. X. Huang, G. Ngai, and H.-V. Leong
“Every Little Movement Has a Meaning of Its Own: Using Past Mouse Movements to Predict the Next Interaction,” in IUI 2018, 23rd International Conference on Intelligent User Interfaces, Tokyo, Japan, 2018.
530
Conference paper
D2
P. Müller, M. X. Huang, and A. Bulling
“Detecting Low Rapport During Natural Interactions in Small Groups from Non-Verbal Behaviour,” in IUI 2018, 23rd International Conference on Intelligent User Interfaces, Tokyo, Japan, 2018.
531
Conference paper
D2
R. Goebel, A. Chander, K. Holzinger, F. Lecue, Z. Akata, S. Stumpf, P. Kieseberg, and A. Holzinger
“Explainable AI: The New 42?,” in Machine Learning and Knowledge Extraction (CD-MAKE 2018), Hamburg, Germany, 2018.
532
Article
D2
M. Rempfler, V. Stierle, K. Ditzel, S. Kumar, P. Paulitschke, B. Andres, and B. H. Menze
“Tracing Cell Lineages in Videos of Lens-free Microscopy,” Medical Image Analysis, vol. 48, 2018.
533
Conference paper
D2
E. Y. Fu, M. X. Huang, H. V. Leong, and G. Ngai
“Cross-Species Learning: A Low-Cost Approach to Learning Human Fight from Animal Fight,” in MM’18, 26th ACM Multimedia Conference, Seoul, Korea, 2018.
534
Conference paper
D2
M. Khamis, F. Alt, and A. Bulling
“The Past, Present, and Future of Gaze-enabled Handheld Mobile Devices: Survey and Lessons Learned,” in MobileHCI 2018, 20th International Conference on Human-Computer Interaction with Mobile Devices and Services, Barcelona, Spain, 2018.
535
Conference paper
D2
J. Steil, P. Müller, Y. Sugano, and A. Bulling
“Forecasting User Attention During Everyday Mobile Interactions Using Device-Integrated and Wearable Sensors,” in MobileHCI 2018, 20th International Conference on Human-Computer Interaction with Mobile Devices and Services, Barcelona, Spain, 2018.
536
Conference paper
D4D2
M. Habermann, W. Xu, H. Rohdin, M. Zollhöfer, G. Pons-Moll, and C. Theobalt
“NRST: Non-rigid Surface Tracking from Monocular Video,” in Pattern Recognition (GCPR 2018), Stuttgart, Germany, 2019.
537
Conference paper
D2
M. Barz, F. Daiber, D. Sonntag, and A. Bulling
“Error-Aware Gaze-Based Interfaces for Robust Mobile Gaze Interaction,” in Proceedings ETRA 2018, Warsaw, Poland, 2018.
538
Conference paper
D2
T. Mattusch, M. Mirzamohammad, M. Khamis, A. Bulling, and F. Alt
“Hidden Pursuits: Evaluating Gaze-selection via Pursuits when the Stimuli’s Trajectory is Partially Hidden,” in Proceedings ETRA 2018, Warsaw, Poland, 2018.
539
Conference paper
D2
P. Müller, M. X. Huang, X. Zhang, and A. Bulling
“Robust Eye Contact Detection in Natural Multi-Person Interactions Using Gaze and Speaking Behaviour,” in Proceedings ETRA 2018, Warsaw, Poland, 2018.
540
Conference paper
D2
S. Park, X. Zhang, A. Bulling, and O. Hilliges
“Learning to Find Eye Region Landmarks for Remote Gaze Estimation in Unconstrained Settings,” in Proceedings ETRA 2018, Warsaw, Poland, 2018.
541
Conference paper
D2
J. Steil, M. X. Huang, and A. Bulling
“Fixation Detection for Head-Mounted Eye Tracking Based on Visual Similarity of Gaze Targets,” in Proceedings ETRA 2018, Warsaw, Poland, 2018.
542
Conference paper
D2
X. Zhang, Y. Sugano, and A. Bulling
“Revisiting Data Normalization for Appearance-Based Gaze Estimation,” in Proceedings ETRA 2018, Warsaw, Poland, 2018.
543
Conference paper
D2
R. Shetty, B. Schiele, and M. Fritz
“A4NT: Author Attribute Anonymity by Adversarial Training of Neural Machine Translation,” in Proceedings of the 27th USENIX Security Symposium, Baltimore, MD, USA, 2018.
- PuRe
- BibTeX
544
Conference paper
D2D1
J.-H. Lange, A. Karrenbauer, and B. Andres
“Partial Optimality and Fast Lower Bounds for Weighted Correlation Clustering,” in Proceedings of the 35th International Conference on Machine Learning (ICML 2018), Stockholm, Sweden, 2018.
- PuRe
- BibTeX
545
Conference paper
D2
A. Khan, I. Steiner, Y. Sugano, A. Bulling, and R. Macdonald
“A Multimodal Corpus of Expert Gaze and Behavior during Phonetic Segmentation Tasks,” in Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), Miyazaki, Japan, 2018.
- PuRe
- BibTeX
546
Conference paper
D2
L. A. Hendricks, R. Hu, T. Darrell, and Z. Akata
“Generating Counterfactual Explanations with Natural Language,” in Proceedings of the 2018 ICML Workshop on Human Interpretability in Machine Learning (WHI 2018), Stockholm, Sweden, 2018.
mehr
Abstract
Natural language explanations of deep neural network decisions provide an
intuitive way for a AI agent to articulate a reasoning process. Current textual
explanations learn to discuss class discriminative features in an image.
However, it is also helpful to understand which attributes might change a
classification decision if present in an image (e.g., "This is not a Scarlet
Tanager because it does not have black wings.") We call such textual
explanations counterfactual explanations, and propose an intuitive method to
generate counterfactual explanations by inspecting which evidence in an input
is missing, but might contribute to a different classification decision if
present in the image. To demonstrate our method we consider a fine-grained
image classification task in which we take as input an image and a
counterfactual class and output text which explains why the image does not
belong to a counterfactual class. We then analyze our generated counterfactual
explanations both qualitatively and quantitatively using proposed automatic
metrics.
- PuRe
- BibTeX
547
Article
D2
S. M. Azimi, D. Britz, M. Engstler, M. Fritz, and F. Mücklich
“Advanced Steel Microstructure Classification by Deep Learning Methods,” Scientific Reports, vol. 8, 2018.
mehr
Abstract
The inner structure of a material is called microstructure. It stores the
genesis of a material and determines all its physical and chemical properties.
While microstructural characterization is widely spread and well known, the
microstructural classification is mostly done manually by human experts, which
opens doors for huge uncertainties. Since the microstructure could be a
combination of different phases with complex substructures its automatic
classification is very challenging and just a little work in this field has
been carried out. Prior related works apply mostly designed and engineered
features by experts and classify microstructure separately from feature
extraction step. Recently Deep Learning methods have shown surprisingly good
performance in vision applications by learning the features from data together
with the classification step. In this work, we propose a deep learning method
for microstructure classification in the examples of certain microstructural
constituents of low carbon steel. This novel method employs pixel-wise
segmentation via Fully Convolutional Neural Networks (FCNN) accompanied by
max-voting scheme. Our system achieves 93.94% classification accuracy,
drastically outperforming the state-of-the-art method of 48.89% accuracy,
indicating the effectiveness of pixel-wise approaches. Beyond the success
presented in this paper, this line of research offers a more robust and first
of all objective way for the difficult task of steel quality appreciation.
548
Conference paper
D2
S. J. Oh, M. Augustin, B. Schiele, and M. Fritz
“Towards Reverse-Engineering Black-Box Neural Networks,” in Sixth International Conference on Learning Representations (ICLR 2018), Vancouver, Canada, 2018.
- PuRe
- BibTeX
549
Conference paper
D2
A. Bhattacharyya, M. Malinowski, B. Schiele, and M. Fritz
“Long-Term Image Boundary Prediction,” in Thirty-Second AAAI Conference on Artificial Intelligence, New Orleans, LA, USA, 2018.
- PuRe
- BibTeX
550
Paper
D4D2
F. Bernard, J. Thunberg, P. Swoboda, and C. Theobalt
“Higher-order Projected Power Iterations for Scalable Multi-Matching,” 2018. [Online]. Available: http://arxiv.org/abs/1811.10541.
mehr
Abstract
The matching of multiple objects (e.g. shapes or images) is a fundamental
problem in vision and graphics. In order to robustly handle ambiguities, noise
and repetitive patterns in challenging real-world settings, it is essential to
take geometric consistency between points into account. Computationally, the
multi-matching problem is difficult. It can be phrased as simultaneously
solving multiple (NP-hard) quadratic assignment problems (QAPs) that are
coupled via cycle-consistency constraints. The main limitations of existing
multi-matching methods are that they either ignore geometric consistency and
thus have limited robustness, or they are restricted to small-scale problems
due to their (relatively) high computational cost. We address these
shortcomings by introducing a Higher-order Projected Power Iteration method,
which is (i) efficient and scales to tens of thousands of points, (ii)
straightforward to implement, (iii) able to incorporate geometric consistency,
and (iv) guarantees cycle-consistent multi-matchings. Experimentally we show
that our approach is superior to existing methods.
- PuRe
- BibTeX
551
Paper
D2
A. Bhattacharyya, M. Fritz, and B. Schiele
“Bayesian Prediction of Future Street Scenes through Importance Sampling based Optimization,” 2018. [Online]. Available: http://arxiv.org/abs/1806.06939.
mehr
Abstract
For autonomous agents to successfully operate in the real world, anticipation
of future events and states of their environment is a key competence. This
problem can be formalized as a sequence prediction problem, where a number of
observations are used to predict the sequence into the future. However,
real-world scenarios demand a model of uncertainty of such predictions, as
future states become increasingly uncertain and multi-modal -- in particular on
long time horizons. This makes modelling and learning challenging. We cast
state of the art semantic segmentation and future prediction models based on
deep learning into a Bayesian formulation that in turn allows for a full
Bayesian treatment of the prediction problem. We present a new sampling scheme
for this model that draws from the success of variational autoencoders by
incorporating a recognition network. In the experiments we show that our model
outperforms prior work in accuracy of the predicted segmentation and provides
calibrated probabilities that also better capture the multi-modal aspects of
possible future states of street scenes.
- PuRe
- BibTeX
552
Proceedings
D2
A. Bulling, E. Kasneci, and C. Lander
Eds., Proceedings PETMEI 2018. ACM, 2018.
- PuRe
- BibTeX
553
Paper
D2
M. Gemici, Z. Akata, and M. Welling
“Primal-Dual Wasserstein GAN,” 2018. [Online]. Available: http://arxiv.org/abs/1805.09575.
mehr
Abstract
We introduce Primal-Dual Wasserstein GAN, a new learning algorithm for
building latent variable models of the data distribution based on the primal
and the dual formulations of the optimal transport (OT) problem. We utilize the
primal formulation to learn a flexible inference mechanism and to create an
optimal approximate coupling between the data distribution and the generative
model. In order to learn the generative model, we use the dual formulation and
train the decoder adversarially through a critic network that is regularized by
the approximate coupling obtained from the primal. Unlike previous methods that
violate various properties of the optimal critic, we regularize the norm and
the direction of the gradients of the critic function. Our model shares many of
the desirable properties of auto-encoding models in terms of mode coverage and
latent structure, while avoiding their undesirable averaging properties, e.g.
their inability to capture sharp visual features when modeling real images. We
compare our algorithm with several other generative modeling techniques that
utilize Wasserstein distances on Frechet Inception Distance (FID) and Inception
Scores (IS).
- PuRe
- BibTeX
554
Paper
D2
L. Hanzlik, Y. Zhang, K. Grosse, A. Salem, M. Augustin, M. Backes, and M. Fritz
“MLCapsule: Guarded Offline Deployment of Machine Learning as a Service,” 2018. [Online]. Available: http://arxiv.org/abs/1808.00590.
mehr
Abstract
With the widespread use of machine learning (ML) techniques, ML as a service
has become increasingly popular. In this setting, an ML model resides on a
server and users can query the model with their data via an API. However, if
the user's input is sensitive, sending it to the server is not an option.
Equally, the service provider does not want to share the model by sending it to
the client for protecting its intellectual property and pay-per-query business
model. In this paper, we propose MLCapsule, a guarded offline deployment of
machine learning as a service. MLCapsule executes the machine learning model
locally on the user's client and therefore the data never leaves the client.
Meanwhile, MLCapsule offers the service provider the same level of control and
security of its model as the commonly used server-side execution. In addition,
MLCapsule is applicable to offline applications that require local execution.
Beyond protecting against direct model access, we demonstrate that MLCapsule
allows for implementing defenses against advanced attacks on machine learning
models such as model stealing/reverse engineering and membership inference.
- PuRe
- BibTeX
555
Paper
D2
L. Karacan, Z. Akata, A. Erdem, and E. Erdem
“Manipulating Attributes of Natural Scenes via Hallucination,” 2018. [Online]. Available: http://arxiv.org/abs/1808.07413.
mehr
Abstract
In this study, we explore building a two-stage framework for enabling users
to directly manipulate high-level attributes of a natural scene. The key to our
approach is a deep generative network which can hallucinate images of a scene
as if they were taken at a different season (e.g. during winter), weather
condition (e.g. in a cloudy day) or time of the day (e.g. at sunset). Once the
scene is hallucinated with the given attributes, the corresponding look is then
transferred to the input image while preserving the semantic details intact,
giving a photo-realistic manipulation result. As the proposed framework
hallucinates what the scene will look like, it does not require any reference
style image as commonly utilized in most of the appearance or style transfer
approaches. Moreover, it allows to simultaneously manipulate a given scene
according to a diverse set of transient attributes within a single model,
eliminating the need of training multiple networks per each translation task.
Our comprehensive set of qualitative and quantitative results demonstrate the
effectiveness of our approach against the competing methods.
- PuRe
- BibTeX
556
Paper
D4D2
K. Z. Lin, W. Xu, Q. Sun, C. Theobalt, and T.-S. Chua
“Learning a Disentangled Embedding for Monocular 3D Shape Retrieval and Pose Estimation,” 2018. [Online]. Available: http://arxiv.org/abs/1812.09899.
mehr
Abstract
We propose a novel approach to jointly perform 3D object retrieval and pose
estimation from monocular images.In order to make the method robust to real
world scene variations in the images, e.g. texture, lighting and background,we
learn an embedding space from 3D data that only includes the relevant
information, namely the shape and pose.Our method can then be trained for
robustness under real world scene variations without having to render a large
training set simulating these variations. Our learned embedding explicitly
disentangles a shape vector and a pose vector, which alleviates both pose bias
for 3D shape retrieval and categorical bias for pose estimation. Having the
learned disentangled embedding, we train a CNN to map the images to the
embedding space, and then retrieve the closest 3D shape from the database and
estimate the 6D pose of the object using the embedding vectors. Our method
achieves 10.8 median error for pose estimation and 0.514 top-1-accuracy for
category agnostic 3D object retrieval on the Pascal3D+ dataset. It therefore
outperforms the previous state-of-the-art methods on both tasks.
- PuRe
- BibTeX
557
Thesis
D2IMPR-CS
W. Li
“From Perception over Anticipation to Manipulation,” Universität des Saarlandes, Saarbrücken, 2018.
mehr
Abstract
From autonomous driving cars to surgical robots, robotic system has enjoyed significant growth over the past decade. With the rapid development in robotics alongside the evolution in the related fields, such as computer vision and machine learning, integrating perception, anticipation and manipulation is key to the success of future robotic system. In this thesis, we explore different ways of such integration to extend the capabilities of a robotic system to take on more challenging real world tasks. On anticipation and perception, we address the recognition of ongoing activity from videos. In particular we focus on long-duration and complex activities and hence propose a new challenging dataset to facilitate the work. We introduce hierarchical labels over the activity classes and investigate the temporal accuracy-specificity trade-offs. We propose a new method based on recurrent neural networks that learns to predict over this hierarchy and realize accuracy specificity trade-offs. Our method outperforms several baselines on this new challenge. On manipulation with perception, we propose an efficient framework for programming a robot to use human tools. We first present a novel and compact model for using tools described by a tip model. Then we explore a strategy of utilizing a dual-gripper approach for manipulating tools – motivated by the absence of dexterous hands on widely available general purpose robots. Afterwards, we embed the tool use learning into a hierarchical architecture and evaluate it on a Baxter research robot. Finally, combining perception, anticipation and manipulation, we focus on a block stacking task. First we explore how to guide robot to place a single block into the scene without collapsing the existing structure. We introduce a mechanism to predict physical stability directly from visual input and evaluate it first on a synthetic data and then on real-world block stacking. Further, we introduce the target stacking task where the agent stacks blocks to reproduce a tower shown in an image. To do so, we create a synthetic block stacking environment with physics simulation in which the agent can learn block stacking end-to-end through trial and error, bypassing to explicitly model the corresponding physics knowledge. We propose a goal-parametrized GDQN model to plan with respect to the specific goal. We validate the model on both a navigation task in a classic gridworld environment and the block stacking task.
558
Paper
D2
M. Maximov, T. Ritschel, and M. Fritz
“Deep Appearance Maps,” 2018. [Online]. Available: http://arxiv.org/abs/1804.00863.
mehr
Abstract
We propose a deep representation of appearance, i. e. the relation of color,
surface orientation, viewer position, material and illumination. Previous
approaches have used deep learning to extract classic appearance
representations relating to reflectance model parameters (e. g. Phong) or
illumination (e. g. HDR environment maps). We suggest to directly represent
appearance itself as a network we call a deep appearance map (DAM). This is a
4D generalization over 2D reflectance maps, which held the view direction
fixed. First, we show how a DAM can be learned from images or video frames and
later be used to synthesize appearance, given new surface orientations and
viewer positions. Second, we demonstrate how another network can be used to map
from an image or video frames to a DAM network to reproduce this appearance,
without using a lengthy optimization such as stochastic gradient descent
(learning-to-learn). Finally, we generalize this to an appearance
estimation-and-segmentation task, where we map from an image showing multiple
materials to multiple networks reproducing their appearance, as well as
per-pixel segmentation.
- PuRe
- BibTeX
559
Thesis
D2IMPR-CS
S. J. Oh
“Image Manipulation against Learned Models Privacy and Security Implications,” Universität des Saarlandes, Saarbrücken, 2018.
mehr
Abstract
Machine learning is transforming the world. Its application areas span privacy
sensitive and security critical tasks such as human identification and self-driving
cars. These applications raise privacy and security related questions that are not
fully understood or answered yet: Can automatic person recognisers identify people
in photos even when their faces are blurred? How easy is it to find an adversarial
input for a self-driving car that makes it drive off the road?
This thesis contributes one of the first steps towards a better understanding of
such concerns. We observe that many privacy and security critical scenarios for
learned models involve input data manipulation: users obfuscate their identity by
blurring their faces and adversaries inject imperceptible perturbations to the input
signal. We introduce a data manipulator framework as a tool for collectively describing
and analysing privacy and security relevant scenarios involving learned models.
A data manipulator introduces a shift in data distribution for achieving privacy or
security related goals, and feeds the transformed input to the target model. This
framework provides a common perspective on the studies presented in the thesis.
We begin the studies from the user’s privacy point of view. We analyse the
efficacy of common obfuscation methods like face blurring, and show that they
are surprisingly ineffective against state of the art person recognition systems. We
then propose alternatives based on head inpainting and adversarial examples. By
studying the user privacy, we also study the dual problem: model security. In model
security perspective, a model ought to be robust and reliable against small amounts
of data manipulation. In both cases, data are manipulated with the goal of changing
the target model prediction. User privacy and model security problems can be
described with the same objective.
We then study the knowledge aspect of the data manipulation problem. The more
one knows about the target model, the more effective manipulations one can craft.
We propose a game theoretic manipulation framework to systematically represent
the knowledge level on the target model and derive privacy and security guarantees.
We then discuss ways to increase knowledge about a black-box model by only querying
it, deriving implications that are relevant to both privacy and security perspectives.
560
Paper
D2
T. Orekondy, S. J. Oh, B. Schiele, and M. Fritz
“Understanding and Controlling User Linkability in Decentralized Learning,” 2018. [Online]. Available: http://arxiv.org/abs/1805.05838.
mehr
Abstract
Machine Learning techniques are widely used by online services (e.g. Google,
Apple) in order to analyze and make predictions on user data. As many of the
provided services are user-centric (e.g. personal photo collections, speech
recognition, personal assistance), user data generated on personal devices is
key to provide the service. In order to protect the data and the privacy of the
user, federated learning techniques have been proposed where the data never
leaves the user's device and "only" model updates are communicated back to the
server. In our work, we propose a new threat model that is not concerned with
learning about the content - but rather is concerned with the linkability of
users during such decentralized learning scenarios.
We show that model updates are characteristic for users and therefore lend
themselves to linkability attacks. We show identification and matching of users
across devices in closed and open world scenarios. In our experiments, we find
our attacks to be highly effective, achieving 20x-175x chance-level
performance.
In order to mitigate the risks of linkability attacks, we study various
strategies. As adding random noise does not offer convincing operation points,
we propose strategies based on using calibrated domain-specific data; we find
these strategies offers substantial protection against linkability threats with
little effect to utility.
- PuRe
- BibTeX
561
Paper
D2
J. Song, B. Andres, M. Black, O. Hilliges, and S. Tang
“End-to-end Learning for Graph Decomposition,” 2018. [Online]. Available: http://arxiv.org/abs/1812.09737.
mehr
Abstract
We propose a novel end-to-end trainable framework for the graph decomposition
problem. The minimum cost multicut problem is first converted to an
unconstrained binary cubic formulation where cycle consistency constraints are
incorporated into the objective function. The new optimization problem can be
viewed as a Conditional Random Field (CRF) in which the random variables are
associated with the binary edge labels of the initial graph and the hard
constraints are introduced in the CRF as high-order potentials. The parameters
of a standard Neural Network and the fully differentiable CRF are optimized in
an end-to-end manner. Furthermore, our method utilizes the cycle constraints as
meta-supervisory signals during the learning of the deep feature
representations by taking the dependencies between the output random variables
into account. We present analyses of the end-to-end learned representations,
showing the impact of the joint training, on the task of clustering images of
MNIST. We also validate the effectiveness of our approach both for the feature
learning and the final clustering on the challenging task of real-world
multi-person pose estimation.
- PuRe
- BibTeX
562
Paper
D2
J. Steil, M. Koelle, W. Heuten, S. Boll, and A. Bulling
“PrivacEye: Privacy-Preserving First-Person Vision Using Image Features and Eye Movement Analysis,” 2018. [Online]. Available: http://arxiv.org/abs/1801.04457.
mehr
Abstract
As first-person cameras in head-mounted displays become increasingly
prevalent, so does the problem of infringing user and bystander privacy. To
address this challenge, we present PrivacEye, a proof-of-concept system that
detects privacysensitive everyday situations and automatically enables and
disables the first-person camera using a mechanical shutter. To close the
shutter, PrivacEye detects sensitive situations from first-person camera videos
using an end-to-end deep-learning model. To open the shutter without visual
input, PrivacEye uses a separate, smaller eye camera to detect changes in
users' eye movements to gauge changes in the "privacy level" of the current
situation. We evaluate PrivacEye on a dataset of first-person videos recorded
in the daily life of 17 participants that they annotated with privacy
sensitivity levels. We discuss the strengths and weaknesses of our
proof-of-concept system based on a quantitative technical evaluation as well as
qualitative insights from semi-structured interviews.
- PuRe
- BibTeX
563
Thesis
D2IMPR-CS
X. Zhang
“Gaze Estimation and Interaction in Real-World Environments,” Universität des Saarlandes, Saarbrücken, 2018.
mehr
Abstract
Following a period of expedited progress in the capabilities of digital systems, the society begins to realize that systems designed to assist people in various tasks can also harm individuals and society. Mediating access to information and explicitly or implicitly ranking people in increasingly many applications, search systems have a substantial potential to contribute to such unwanted outcomes. Since they collect vast amounts of data about both searchers and search subjects, they have the potential to violate the privacy of both of these groups of users. Moreover, in applications where rankings influence people's economic livelihood outside of the platform, such as sharing economy or hiring support websites, search engines have an immense economic power over their users in that they control user exposure in ranked results. This thesis develops new models and methods broadly covering different aspects of privacy and fairness in search systems for both searchers and search subjects. Specifically, it makes the following contributions: (1) We propose a model for computing individually fair rankings where search subjects get exposure proportional to their relevance. The exposure is amortized over time using constrained optimization to overcome searcher attention biases while preserving ranking utility. (2) We propose a model for computing sensitive search exposure where each subject gets to know the sensitive queries that lead to her profile in the top-k search results. The problem of finding exposing queries is technically modeled as reverse nearest neighbor search, followed by a weekly-supervised learning to rank model ordering the queries by privacy-sensitivity. (3) We propose a model for quantifying privacy risks from textual data in online communities. The method builds on a topic model where each topic is annotated by a crowdsourced sensitivity score, and privacy risks are associated with a user's relevance to sensitive topics. We propose relevance measures capturing different dimensions of user interest in a topic and show how they correlate with human risk perceptions. (4) We propose a model for privacy-preserving personalized search where search queries of different users are split and merged into synthetic profiles. The model mediates the privacy-utility trade-off by keeping semantically coherent fragments of search histories within individual profiles, while trying to minimize the similarity of any of the synthetic profiles to the original user profiles. The models are evaluated using information retrieval techniques and user studies over a variety of datasets, ranging from query logs, through social media and community question answering postings, to item listings from sharing economy platforms.

2017

564
Conference paper
D2
M. Khamis, L. Bandelow, S. Schick, D. Casadevall, A. Bulling, and F. Alt
“They are all after you: Investigating the Viability of a Threat Model that involves Multiple Shoulder Surfers,” in 16th International Conference on Mobile and Ubiquitous Multimedia (MUM 2017), Stuttgart, Germany, 2017.
565
Conference paper
D2
C. Lander, S. Gehring, M. Löchtefeld, A. Bulling, and A. Krüger
“EyeMirror: Mobile Calibration-Free Gaze Approximation using Corneal Imaging,” in 16th International Conference on Mobile and Ubiquitous Multimedia (MUM 2017), Stuttgart, Germany, 2017.
566
Conference paper
D2
A. Bhattacharyya, M. Fritz, and B. Schiele
“Long-Term On-Board Prediction of Pedestrians in Traffic Scenes,” in 1st Conference on Robot Learning (CoRL 2017), Mountain View, CA, USA, 2017.
- PuRe
- BibTeX
567
Conference paper
D2
S. Ebrahimi, A. Rohrbach, and T. Darrell
“Gradient-free Policy Architecture Search and Adaptation,” in 1st Conference on Robot Learning (CoRL 2017), Mountain View, CA, USA, 2017.
- PuRe
- BibTeX
568
Conference paper
D2
Y. He, W.-C. Chiu, M. Keuper, and M. Fritz
“STD2P: RGBD Semantic Segmentation Using Spatio-Temporal Data-Driven Pooling,” in 30th IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2017), Honolulu, HI, USA, 2017.
569
Conference paper
D2
J. Hosang, R. Benenson, and B. Schiele
“Learning Non-maximum Suppression,” in 30th IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2017), Honolulu, HI, USA, 2017.
570
Conference paper
D2
E. Insafutdinov, M. Andriluka, L. Pishchulin, S. Tang, E. Levinkov, B. Andres, and B. Schiele
“ArtTrack: Articulated Multi-Person Tracking in the Wild,” in 30th IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2017), Honolulu, HI, USA, 2017.
- PDF
- DOI
- PuRe
- BibTeX
571
Conference paper
D2
N. Karessli, Z. Akata, B. Schiele, and A. Bulling
“Gaze Embeddings for Zero-Shot Image Classification,” in 30th IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2017), Honolulu, HI, USA, 2017.
572
Conference paper
D2
A. Khoreva, F. Perazzi, R. Benenson, B. Schiele, and A. Sorkine-Hornung
“Learning Video Object Segmentation from Static Images,” in 30th IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2017), Honolulu, HI, USA, 2017.
573
Conference paper
D2
A. Khoreva, R. Benenson, J. Hosang, M. Hein, and B. Schiele
“Simple Does It: Weakly Supervised Instance and Semantic Segmentation,” in 30th IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2017), Honolulu, HI, USA, 2017.
- PDF
- DOI
- PuRe
- BibTeX
574
Conference paper
D2
A. Kirillov, E. Levinkov, B. Andres, B. Savchynskyy, and C. Rother
“InstanceCut: from Edges to Instances with MultiCut,” in 30th IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2017), Honolulu, HI, USA, 2017.
575
Conference paper
D2
E. Levinkov, J. Uhrig, S. Tang, M. Omran, E. Insafutdinov, A. Kirillov, C. Rother, T. Brox, B. Schiele, and B. Andres
“Joint Graph Decomposition and Node Labeling: Problem, Algorithms, Applications,” in 30th IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2017), Honolulu, HI, USA, 2017.
576
Conference paper
D2
T. Maharaj, N. Ballas, A. Rohrbach, A. Courville, and C. Pal
“A Dataset and Exploration of Models for Understanding Video Data through Fill-in-the-blank Question-answering,” in 30th IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2017), Honolulu, HI, USA, 2017.
- PDF
- DOI
- PuRe
- BibTeX
577
Conference paper
D2
S. J. Oh, R. Benenson, A. Khoreva, Z. Akata, M. Fritz, and B. Schiele
“Exploiting Saliency for Object Segmentation from Image Level Labels,” in 30th IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2017), Honolulu, HI, USA, 2017.
578
Conference paper
D2
A. Rohrbach, M. Rohrbach, S. Tang, S. J. Oh, and B. Schiele
“Generating Descriptions with Grounded and Co-Referenced People,” in 30th IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2017), Honolulu, HI, USA, 2017.
- PDF
- DOI
- PuRe
- BibTeX
579
Conference paper
D2
Q. Sun, B. Schiele, and M. Fritz
“A Domain Based Approach to Social Relation Recognition,” in 30th IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2017), Honolulu, HI, USA, 2017.
580
Conference paper
D2
P. Swoboda and B. Andres
“A Message Passing Algorithm for the Minimum Cost Multicut Problem,” in 30th IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2017), Honolulu, HI, USA, 2017.
581
Conference paper
D2
S. Tang, M. Andriluka, B. Andres, and B. Schiele
“Multiple People Tracking by Lifted Multicut and Person Re-identification,” in 30th IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2017), Honolulu, HI, USA, 2017.
- PDF
- DOI
- PuRe
- BibTeX
582
Conference paper
D2
Y. Xian, B. Schiele, and Z. Akata
“Zero-shot learning - The Good, the Bad and the Ugly,” in 30th IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2017), Honolulu, HI, USA, 2017.
- PDF
- DOI
- PuRe
- BibTeX
583
Conference paper
D2
S. Zhang, R. Benenson, and B. Schiele
“CityPersons: A Diverse Dataset for Pedestrian Detection,” in 30th IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2017), Honolulu, HI, USA, 2017.
mehr
Abstract
Convnets have enabled significant progress in pedestrian detection recently,
but there are still open questions regarding suitable architectures and
training data. We revisit CNN design and point out key adaptations, enabling
plain FasterRCNN to obtain state-of-the-art results on the Caltech dataset.
To achieve further improvement from more and better data, we introduce
CityPersons, a new set of person annotations on top of the Cityscapes dataset.
The diversity of CityPersons allows us for the first time to train one single
CNN model that generalizes well over multiple benchmarks. Moreover, with
additional training with CityPersons, we obtain top results using FasterRCNN on
Caltech, improving especially for more difficult cases (heavy occlusion and
small scale) and providing higher localization quality.
584
Conference paper
D2
X. Zhang, Y. Sugano, M. Fritz, and A. Bulling
“It’s Written All Over Your Face: Full-Face Appearance-Based Gaze Estimation,” in 30th IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW 2017), Honolulu, HI, USA, 2017.
585
Conference paper
D2
W. Li, A. Leonardis, and M. Fritz
“Visual Stability Prediction and Its Application to Manipulation,” in AAAI 2017 Spring Symposia 05, Interactive Multisensory Object Perception for Embodied Agents, Palo Alto, CA, 2017.
- PuRe
- BibTeX
586
Conference paper
D2
L. Ma, X. Jia, Q. Sun, B. Schiele, T. Tuytelaars, and L. Van Gool
“Pose Guided Person Image Generation,” in Advances in Neural Information Processing Systems 30 (NIPS 2017), Long Beach, CA, USA, 2017.
- PuRe
- BibTeX
587
Conference paper
D2
M. X. Huang, J. Li, G. Ngai, and H. V. Leong
“ScreenGlint: Practical, In-situ Gaze Estimation on Smartphones,” in CHI’17, 35th Annual ACM Conference on Human Factors in Computing Systems, Denver, CO, USA, 2017.
588
Conference paper
D2
M. Klauck, Y. Sugano, and A. Bulling
“Noticeable or Distractive? A Design Space for Gaze-Contingent User Interface Notifications,” in CHI 2017 Extended Abstracts, Denver, CO, USA, 2017.
589
Conference paper
D2
A. Khoreva, R. Benenson, E. Ilg, T. Brox, and B. Schiele
“Lucid Data Dreaming for Object Tracking,” in DAVIS Challenge on Video Object Segmentation 2017, Honolulu, HI, USA, 2017.
- PuRe
- BibTeX
590
Conference paper
D2
M. Khamis,, M. Hassib, E. von Zezschwitz, A. Bulling, and F. Alt
“GazeTouchPIN: Protecting Sensitive Data on Mobile Devices using Secure Multimodal Authentication,” in ICMI’17, 19th ACM International Conference on Multimodal Interaction, Glasgow, UK, 2017.
591
Conference paper
D2D4
S. Georgoulis, K. Rematas, T. Ritschel, M. Fritz, T. Tuytelaars, and L. Van Gool
“What Is Around The Camera?,” in IEEE International Conference on Computer Vision (ICCV 2017), Venice, Italy, 2017.
592
Conference paper
D2
S. J. Oh, M. Fritz, and B. Schiele
“Adversarial Image Perturbation for Privacy Protection -- A Game Theory Perspective,” in IEEE International Conference on Computer Vision (ICCV 2017), Venice, Italy, 2017.
593
Conference paper
D2
T. Orekondy, B. Schiele, and M. Fritz
“Towards a Visual Privacy Advisor: Understanding and Predicting Privacy Risks in Images,” in IEEE International Conference on Computer Vision (ICCV 2017), Venice, Italy, 2017.
594
Conference paper
D2
M. Rempfler, J.-H. Lange, F. Jug, C. Blasse, E. W. Myers, B. H. Menze, and B. Andres
“Efficient Algorithms for Moral Lineage Tracing,” in IEEE International Conference on Computer Vision (ICCV 2017), Venice, Italy, 2017.
595
Conference paper
D2
R. Shetty, M. Rohrbach, L. A. Hendricks, M. Fritz, and B. Schiele
“Speaking the Same Language: Matching Machine to Human Captions by Adversarial Training,” in IEEE International Conference on Computer Vision (ICCV 2017), Venice, Italy, 2017.
596
Conference paper
D2
H. R. Tavakoli, R. Shetty, A. Borji, and J. Laaksonen
“Paying Attention to Descriptions Generated by Image Captioning Models,” in IEEE International Conference on Computer Vision (ICCV 2017), Venice, Italy, 2017.
597
Conference paper
D2
H. Sattar, A. Bulling, and M. Fritz
“Predicting the Category and Attributes of Visual Search Targets Using Deep Gaze Pooling,” in 2017 IEEE International Conference on Computer Vision Workshops (MBCC @ICCV 2017), Venice, Italy, 2017.
mehr
Abstract
Previous work focused on predicting visual search targets from human
fixations but, in the real world, a specific target is often not known, e.g.
when searching for a present for a friend. In this work we instead study the
problem of predicting the mental picture, i.e. only an abstract idea instead of
a specific target. This task is significantly more challenging given that
mental pictures of the same target category can vary widely depending on
personal biases, and given that characteristic target attributes can often not
be verbalised explicitly. We instead propose to use gaze information as
implicit information on users' mental picture and present a novel gaze pooling
layer to seamlessly integrate semantic and localized fixation information into
a deep image representation. We show that we can robustly predict both the
mental picture's category as well as attributes on a novel dataset containing
fixation data of 14 users searching for targets on a subset of the DeepFahion
dataset. Our results have important implications for future search interfaces
and suggest deep gaze pooling as a general-purpose approach for gaze-supported
computer vision systems.
598
Conference paper
D2
W. Li, A. Leonardis, and M. Fritz
“Visual Stability Prediction for Robotic Manipulation,” in IEEE International Conference on Robotics and Automation (ICRA 2017), Singapore, 2017.
599
Article
D4D2
A. Elhayek, E. de Aguiar, A. Jain, J. Tompson, L. Pishchulin, M. Andriluka, C. Bregler, B. Schiele, and C. Theobalt
“MARCOnI -- ConvNet-Based MARker-Less Motion Capture in Outdoor and Indoor Scenes,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 39, no. 3, 2017.
600
Article
D4D2
K. Rematas, C. Nguyen, T. Ritschel, M. Fritz, and T. Tuytelaars
“Novel Views of Objects from a Single Image,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 39, no. 8, 2017.
601
Article
D2
G. Sharma, F. Jurie, and C. Schmid
“Expanded Parts Model for Semantic Description of Humans in Still Images,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 39, no. 1, 2017.
602
Article
D2
R. Ding, Q. Sun, M. Liu, and H. Liu
“A Compact Representation of Human Actions by Sliding Coordinate Coding,” International Journal of Advanced Robotic Systems, vol. 14, no. 6, 2017.
603
Article
D2
M. Malinowski, M. Rohrbach, and M. Fritz
“Ask Your Neurons: A Deep Learning Approach to Visual Question Answering,” International Journal of Computer Vision, vol. 125, no. 1–3, 2017.
- PDF
- DOI
- PuRe
- BibTeX
604
Article
D2D5
A. Rohrbach, A. Torabi, M. Rohrbach, N. Tandon, C. Pal, H. Larochelle, A. Courville, and B. Schiele
“Movie Description,” International Journal of Computer Vision, vol. 123, no. 1, 2017.
mehr
Abstract
Audio Description (AD) provides linguistic descriptions of movies and allows
visually impaired people to follow a movie along with their peers. Such
descriptions are by design mainly visual and thus naturally form an interesting
data source for computer vision and computational linguistics. In this work we
propose a novel dataset which contains transcribed ADs, which are temporally
aligned to full length movies. In addition we also collected and aligned movie
scripts used in prior work and compare the two sources of descriptions. In
total the Large Scale Movie Description Challenge (LSMDC) contains a parallel
corpus of 118,114 sentences and video clips from 202 movies. First we
characterize the dataset by benchmarking different approaches for generating
video descriptions. Comparing ADs to scripts, we find that ADs are indeed more
visual and describe precisely what is shown rather than what should happen
according to the scripts created prior to movie production. Furthermore, we
present and compare the results of several teams who participated in a
challenge organized in the context of the workshop "Describing and
Understanding Video & The Large Scale Movie Description Challenge (LSMDC)", at
ICCV 2015.
- PDF
- DOI
- PuRe
- BibTeX
605
Conference paper
D2
M. Rempfler, S. Kumar, V. Stierle, P. Paulitschke, B. Andres, and B. H. Menze
“Cell Lineage Tracing in Lens-Free Microscopy Videos,” in Medical Image Computing and Computer Assisted Intervention -- MICCAI 2017, Quebec City, Canada, 2017.
606
Article
D2D4
L. Pishchulin, S. Wuhrer, T. Helten, C. Theobalt, and B. Schiele
“Building Statistical Shape Spaces for 3D Human Modeling,” Pattern Recognition, vol. 67, 2017.
607
Article
D2
Q. Sun, H. Liu, and T. Harada
“Online Growing Neural Gas for Anomaly Detection in Changing Surveillance Scenes,” Pattern Recognition, vol. 64, 2017.
608
Conference paper
D2
Y. He, M. Keuper, B. Schiele, and M. Fritz
“Learning Dilation Factors for Semantic Segmentation of Street Scenes,” in Pattern Recognition (GCPR 2017), Basel, Switzerland, 2017.
609
Conference paper
D2
E. Levinkov, A. Kirillov, and B. Andres
“A Comparative Study of Local Search Algorithms for Correlation Clustering,” in Pattern Recognition (GCPR 2017), Basel, Switzerland, 2017.
610
Article
D2
Y. Zhang, K. Pfeuffer, M. K. Chong, J. Alexander, A. Bulling, and H. Gellersen
“Look Together: Using Gaze for Assisting Co-located Collaborative Search,” Personal and Ubiquitous Computing, vol. 21, no. 1, 2017.
611
Conference paper
D2
M. Khamis, R. Hasholzner, A. Bulling, and F. Alt
“GTmoPass: Two-factor Authentication on Public Displays Using GazeTouch passwords and Personal Mobile Devices,” in Pervasive Displays 2017 (PerDis 2017), Lugano, Switzerland, 2017.
612
Conference paper
D2
A. Horňáková, J.-H. Lange, and B. Andres
“Analysis and Optimization of Graph Decompositions by Lifted Multicuts,” in Proceedings of the 34th International Conference on Machine Learning (ICML 2017), Sydney, Australia, 2017.
- PuRe
- BibTeX
613
Article
D2
M. Khamis, D. Buschek, T. Thieron, F. Alt, and A. Bulling
“EyePACT: Eye-Based Parallax Correction on Touch-Enabled Interactive Displays,” Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, vol. 1, no. 4, 2017.
614
Article
D2
M. Tonsen, J. Steil, Y. Sugano, and A. Bulling
“InvisibleEye: Mobile Eye Tracking Using Multiple Low-Resolution Cameras and Learning-Based Gaze Estimation,” Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, vol. 1, no. 3, 2017.
615
Conference paper
D5D2
A. Bhattacharyya and J. Vreeken
“Efficiently Summarising Event Sequences with Rich Interleaving Patterns,” in Proceedings of the Seventeenth SIAM International Conference on Data Mining (SDM 2017), Houston, TX, USA, 2017.
616
Conference paper
D2
J. Wang, M. X. Huang, G. Ngai, and H. V. Leong
“Are you stressed? Your eyes and the mouse can tell,” in Seventh International Conference on Affective Computing and Intelligent Interaction (ACII 2017), San Antonio, TX, USA, 2017.
617
Conference paper
D2
M. Khamis, A. Hoesl, A. Klimczak, M. Reiss, F. Alt, and A. Bulling
“EyeScout: Active Eye Tracking for Position and Movement Independent Gaze Interaction with Large Public Displays,” in UIST’17, 30th Annual Symposium on User Interface Software and Technology, Quebec City, Canada, 2017.
618
Conference paper
D2
X. Zhang, Y. Sugano, and A. Bulling
“Everyday Eye Contact Detection Using Unsupervised Gaze Target Discovery,” in UIST’17, 30th Annual Symposium on User Interface Software and Technology, Quebec City, Canada, 2017.
619
Thesis
D2IMPR-CS
J. Hosang
“Analysis and Improvement of the Visual Object Detection Pipeline,” Universität des Saarlandes, Saarbrücken, 2017.
mehr
Abstract
Visual object detection has seen substantial improvements during the last years due to the possibilities enabled by deep learning. While research on image classification provides continuous progress on how to learn image representations and classifiers jointly, object detection research focuses on identifying how to properly use deep learning technology to effectively localise objects. In this thesis, we analyse and improve different aspects of the commonly used detection pipeline. We analyse ten years of research on pedestrian detection and find that improvement of feature representations was the driving factor. Motivated by this finding, we adapt an end-to-end learned detector architecture from general object detection to pedestrian detection. Our deep network outperforms all previous neural networks for pedestrian detection by a large margin, even without using additional training data. After substantial improvements on pedestrian detection in recent years, we investigate the gap between human performance and state-of-the-art pedestrian detectors. We find that pedestrian detectors still have a long way to go before they reach human performance, and we diagnose failure modes of several top performing detectors, giving direction to future research. As a side-effect we publish new, better localised annotations for the Caltech pedestrian benchmark. We analyse detection proposals as a preprocessing step for object detectors. We establish different metrics and compare a wide range of methods according to these metrics. By examining the relationship between localisation of proposals and final object detection performance, we define and experimentally verify a metric that can be used as a proxy for detector performance. Furthermore, we address a structural weakness of virtually all object detection pipelines: non-maximum suppression. We analyse why it is necessary and what the shortcomings of the most common approach are. To address these problems, we present work to overcome these shortcomings and to replace typical non-maximum suppression with a learnable alternative. The introduced paradigm paves the way to true end-to-end learning of object detectors without any post-processing. In summary, this thesis provides analyses of recent pedestrian detectors and detection proposals, improves pedestrian detection by employing deep neural networks, and presents a viable alternative to traditional non-maximum suppression.
620
Thesis
D2IMPR-CS
A. Khoreva
“Learning to Segment in Images and Videos with Different Forms of Supervision,” Universität des Saarlandes, Saarbrücken, 2017.
mehr
Abstract
Much progress has been made in image and video segmentation
over the last years. To a large extent, the success can be attributed to
the strong appearance models completely learned from data, in particular
using deep learning methods. However,to perform best these methods require
large representative datasets for training with expensive pixel-level
annotations, which in case of videos are prohibitive to obtain. Therefore,
there is a need to relax this constraint and to consider alternative forms
of supervision, which are easier and cheaper to collect. In this thesis,
we aim to develop algorithms for learning to segment in images and videos
with different levels of supervision.
First, we develop approaches for training convolutional networks with weaker
forms of supervision, such as bounding boxes or image labels, for object
boundary estimation and semantic/instance labelling tasks. We propose to
generate pixel-level approximate groundtruth from these weaker forms of
annotations to train a network, which allows to achieve high-quality
results comparable to the full supervision quality without any
modifications of the network architecture or the training procedure.
Second, we address the problem of the excessive computational and memory
costs inherent to solving video segmentation via graphs. We propose
approaches to improve the runtime and memory efficiency as well as the
output segmentation quality by learning from the available training data
the best representation of the graph. In particular, we contribute with
learning must-link constraints, the topology and edge weights of the graph
as well as enhancing the graph nodes - superpixels - themselves.
Third, we tackle the task of pixel-level object tracking and address the
problem of the limited amount of densely annotated video data for training
convolutional networks. We introduce an architecture which allows training
with static images only and propose an elaborate data synthesis scheme
which creates a large number of training examples close to the target
domain from the given first frame mask. With the proposed techniques we
show that densely annotated consequent video data is not necessary to
achieve high-quality temporally coherent video segmentationresults.
In summary, this thesis advances the state of the art in weakly supervised
image segmentation, graph-based video segmentation and pixel-level object
tracking and contributes with the new ways of training convolutional
networks with a limited amount of pixel-level annotated training data.
621
Paper
D2
A. Khoreva, R. Benenson, E. Ilg, T. Brox, and B. Schiele
“Lucid Data Dreaming for Multiple Object Tracking,” 2017. [Online]. Available: http://arxiv.org/abs/1703.09554.
mehr
Abstract
Convolutional networks reach top quality in pixel-level object tracking but
require a large amount of training data (1k ~ 10k) to deliver such results. We
propose a new training strategy which achieves state-of-the-art results across
three evaluation datasets while using 20x ~ 100x less annotated data than
competing methods. Instead of using large training sets hoping to generalize
across domains, we generate in-domain training data using the provided
annotation on the first frame of each video to synthesize ("lucid dream")
plausible future video frames. In-domain per-video training data allows us to
train high quality appearance- and motion-based models, as well as tune the
post-processing stage. This approach allows to reach competitive results even
when training from only a single annotated frame, without ImageNet
pre-training. Our results indicate that using a larger training set is not
automatically better, and that for the tracking task a smaller training set
that is closer to the target domain is more effective. This changes the mindset
regarding how many training samples and general "objectness" knowledge are
required for the object tracking task.
- PuRe
- BibTeX
622
Thesis
D2IMPR-CS
M. Lapin
“Image Classification with Limited Training Data and Class Ambiguity,” Universität des Saarlandes, Saarbrücken, 2017.
mehr
Abstract
Modern image classification methods are based on supervised learning algorithms that require labeled training data. However, only a limited amount of annotated data may be available in certain applications due to scarcity of the data itself or high costs associated with human annotation. Introduction of additional information and structural constraints can help improve the performance of a learning algorithm. In this thesis, we study the framework of learning using privileged information and demonstrate its relation to learning with instance weights. We also consider multitask feature learning and develop an efficient dual optimization scheme that is particularly well suited to problems with high dimensional image descriptors. Scaling annotation to a large number of image categories leads to the problem of class ambiguity where clear distinction between the classes is no longer possible. Many real world images are naturally multilabel yet the existing annotation might only contain a single label. In this thesis, we propose and analyze a number of loss functions that allow for a certain tolerance in top k predictions of a learner. Our results indicate consistent improvements over the standard loss functions that put more penalty on the first incorrect prediction compared to the proposed losses. All proposed learning methods are complemented with efficient optimization schemes that are based on stochastic dual coordinate ascent for convex problems and on gradient descent for nonconvex formulations.
623
Paper
D2
W. Li, J. Bohg, and M. Fritz
“Acquiring Target Stacking Skills by Goal-Parameterized Deep Reinforcement Learning,” 2017. [Online]. Available: http://arxiv.org/abs/1711.00267.
mehr
Abstract
Understanding physical phenomena is a key component of human intelligence and
enables physical interaction with previously unseen environments. In this
paper, we study how an artificial agent can autonomously acquire this intuition
through interaction with the environment. We created a synthetic block stacking
environment with physics simulation in which the agent can learn a policy
end-to-end through trial and error. Thereby, we bypass to explicitly model
physical knowledge within the policy. We are specifically interested in tasks
that require the agent to reach a given goal state that may be different for
every new trial. To this end, we propose a deep reinforcement learning
framework that learns policies which are parametrized by a goal. We validated
the model on a toy example navigating in a grid world with different target
positions and in a block stacking task with different target structures of the
final tower. In contrast to prior work, our policies show better generalization
across different goals.
- PuRe
- BibTeX
624
Thesis
D2IMPR-CS
M. Malinowski
“Towards Holistic Machines: From Visual Recognition To Question Answering About Real-world Image,” Universität des Saarlandes, Saarbrücken, 2017.
mehr
Abstract
Computer Vision has undergone major changes over the recent five years. Here, we investigate if the performance of such architectures generalizes to more complex tasks that require a more holistic approach to scene comprehension. The presented work focuses on learning spatial and multi-modal representations, and the foundations of a Visual Turing Test, where the scene understanding is tested by a series of questions about its content. In our studies, we propose DAQUAR, the first ‘question answering about real-world images’ dataset together with methods, termed a symbolic-based and a neural-based visual question answering architectures, that address the problem. The symbolic-based method relies on a semantic parser, a database of visual facts, and a bayesian formulation that accounts for various interpretations of the visual scene. The neural-based method is an end-to-end architecture composed of a question encoder, image encoder, multimodal embedding, and answer decoder. This architecture has proven to be effective in capturing language-based biases. It also becomes the standard component of other visual question answering architectures. Along with the methods, we also investigate various evaluation metrics that embraces uncertainty in word's meaning, and various interpretations of the scene and the question.
625
Paper
D2
S. J. Oh, R. Benenson, M. Fritz, and B. Schiele
“Person Recognition in Social Media Photos,” 2017. [Online]. Available: http://arxiv.org/abs/1710.03224.
mehr
Abstract
People nowadays share large parts of their personal lives through social
media. Being able to automatically recognise people in personal photos may
greatly enhance user convenience by easing photo album organisation. For human
identification task, however, traditional focus of computer vision has been
face recognition and pedestrian re-identification. Person recognition in social
media photos sets new challenges for computer vision, including non-cooperative
subjects (e.g. backward viewpoints, unusual poses) and great changes in
appearance. To tackle this problem, we build a simple person recognition
framework that leverages convnet features from multiple image regions (head,
body, etc.). We propose new recognition scenarios that focus on the time and
appearance gap between training and testing samples. We present an in-depth
analysis of the importance of different features according to time and
viewpoint generalisability. In the process, we verify that our simple approach
achieves the state of the art result on the PIPA benchmark, arguably the
largest social media based benchmark for person recognition to date with
diverse poses, viewpoints, social groups, and events.
Compared the conference version of the paper, this paper additionally
presents (1) analysis of a face recogniser (DeepID2+), (2) new method naeil2
that combines the conference version method naeil and DeepID2+ to achieve state
of the art results even compared to post-conference works, (3) discussion of
related work since the conference version, (4) additional analysis including
the head viewpoint-wise breakdown of performance, and (5) results on the
open-world setup.
- PuRe
- BibTeX
626
Paper
D2
S. J. Oh, M. Augustin, B. Schiele, and M. Fritz
“Whitening Black-Box Neural Networks,” 2017. [Online]. Available: http://arxiv.org/abs/1711.01768.
mehr
Abstract
Many deployed learned models are black boxes: given input, returns output.
Internal information about the model, such as the architecture, optimisation
procedure, or training data, is not disclosed explicitly as it might contain
proprietary information or make the system more vulnerable. This work shows
that such attributes of neural networks can be exposed from a sequence of
queries. This has multiple implications. On the one hand, our work exposes the
vulnerability of black-box neural networks to different types of attacks -- we
show that the revealed internal information helps generate more effective
adversarial examples against the black box model. On the other hand, this
technique can be used for better protection of private content from automatic
recognition models using adversarial examples. Our paper suggests that it is
actually hard to draw a line between white box and black box models.
- PuRe
- BibTeX
627
Paper
D2
D. H. Park, L. A. Hendricks, Z. Akata, A. Rohrbach, B. Schiele, T. Darrell, and M. Rohrbach
“Attentive Explanations: Justifying Decisions and Pointing to the Evidence (Extended Abstract),” 2017. [Online]. Available: http://arxiv.org/abs/1711.07373.
mehr
Abstract
Deep models are the defacto standard in visual decision problems due to their
impressive performance on a wide array of visual tasks. On the other hand,
their opaqueness has led to a surge of interest in explainable systems. In this
work, we emphasize the importance of model explanation in various forms such as
visual pointing and textual justification. The lack of data with justification
annotations is one of the bottlenecks of generating multimodal explanations.
Thus, we propose two large-scale datasets with annotations that visually and
textually justify a classification decision for various activities, i.e. ACT-X,
and for question answering, i.e. VQA-X. We also introduce a multimodal
methodology for generating visual and textual explanations simultaneously. We
quantitatively show that training with the textual explanations not only yields
better textual justification models, but also models that better localize the
evidence that support their decision.
- PuRe
- BibTeX
628
Thesis
D2IMPR-CS
A. Rohrbach
“Generation and Grounding of Natural Language Descriptions for Visual Data,” Universität des Saarlandes, Saarbrücken, 2017.
mehr
Abstract
Generating natural language descriptions for visual data links computer vision and computational linguistics. Being able to generate a concise and human-readable description of a video is a step towards visual understanding. At the same time, grounding natural language in visual data provides disambiguation for the linguistic concepts, necessary for many applications. This thesis focuses on both directions and tackles three specific problems. First, we develop recognition approaches to understand video of complex cooking activities. We propose an approach to generate coherent multi-sentence descriptions for our videos. Furthermore, we tackle the new task of describing videos at variable level of detail. Second, we present a large-scale dataset of movies and aligned professional descriptions. We propose an approach, which learns from videos and sentences to describe movie clips relying on robust recognition of visual semantic concepts. Third, we propose an approach to ground textual phrases in images with little or no localization supervision, which we further improve by introducing Multimodal Compact Bilinear Pooling for combining language and vision representations. Finally, we jointly address the task of describing videos and grounding the described people. To summarize, this thesis advances the state-of-the-art in automatic video description and visual grounding and also contributes large datasets for studying the intersection of computer vision and computational linguistics.
- PDF
- DOI
- PuRe
- BibTeX
629
Paper
D2
H. Sattar, M. Fritz, and A. Bulling
“Visual Decoding of Targets During Visual Search From Human Eye Fixations,” 2017. [Online]. Available: http://arxiv.org/abs/1706.05993.
mehr
Abstract
What does human gaze reveal about a users' intents and to which extend can
these intents be inferred or even visualized? Gaze was proposed as an implicit
source of information to predict the target of visual search and, more
recently, to predict the object class and attributes of the search target. In
this work, we go one step further and investigate the feasibility of combining
recent advances in encoding human gaze information using deep convolutional
neural networks with the power of generative image models to visually decode,
i.e. create a visual representation of, the search target. Such visual decoding
is challenging for two reasons: 1) the search target only resides in the user's
mind as a subjective visual pattern, and can most often not even be described
verbally by the person, and 2) it is, as of yet, unclear if gaze fixations
contain sufficient information for this task at all. We show, for the first
time, that visual representations of search targets can indeed be decoded only
from human gaze fixations. We propose to first encode fixations into a semantic
representation and then decode this representation into an image. We evaluate
our method on a recent gaze dataset of 14 participants searching for clothing
in image collages and validate the model's predictions using two human studies.
Our results show that 62% (Chance level = 10%) of the time users were able to
select the categories of the decoded image right. In our second studies we show
the importance of a local gaze encoding for decoding visual search targets of
user
- PuRe
- BibTeX
630
Thesis
D2IMPR-CS
S. Tang
“People detection and tracking in crowded scenes,” Universität des Saarlandes, Saarbrücken, 2017.
mehr
Abstract
People are often a central element of visual scenes, particularly in real-world street scenes. Thus it has been a long-standing goal in Computer Vision to develop methods aiming at analyzing humans in visual data. Due to the complexity of real-world scenes, visual understanding of people remains challenging for machine perception. In this thesis we focus on advancing the techniques for people detection and tracking in crowded street scenes. We also propose new models for human pose estimation and motion segmentation in realistic images and videos. First, we propose detection models that are jointly trained to detect single person as well as pairs of people under varying degrees of occlusion. The learning algorithm of our joint detector facilitates a tight integration of tracking and detection, because it is designed to address common failure cases during tracking due to long-term inter-object occlusions. Second, we propose novel multi person tracking models that formulate tracking as a graph partitioning problem. Our models jointly cluster detection hypotheses in space and time, eliminating the need for a heuristic non-maximum suppression. Furthermore, for crowded scenes, our tracking model encodes long-range person re-identification information into the detection clustering process in a unified and rigorous manner. Third, we explore the visual tracking task in different granularity. We present a tracking model that simultaneously clusters object bounding boxes and pixel level trajectories over time. This approach provides a rich understanding of the motion of objects in the scene. Last, we extend our tracking model for the multi person pose estimation task. We introduce a joint subset partitioning and labelling model where we simultaneously estimate the poses of all the people in the scene. In summary, this thesis addresses a number of diverse tasks that aim to enable vision systems to analyze people in realistic images and videos. In particular, the thesis proposes several novel ideas and rigorous mathematical formulations, pushes the boundary of state-of-the-arts and results in superior performance.

2016

631
Conference paper
D2
Z. Akata, M. Malinowski, M. Fritz, and B. Schiele
“Multi-Cue Zero-Shot Learning with Strong Supervision,” in 29th IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2016), Las Vegas, NV, USA, 2016.
- PDF
- DOI
- PuRe
- BibTeX
632
Conference paper
D2
B. Bhattarai, G. Sharma, and F. Jurie
“CP-mtML: Coupled Projection Multi-task Metric Learning for Large Scale Face Retrieval,” in 29th IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2016), Las Vegas, NV, USA, 2016.
633
Conference paper
D2
M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. Franke, S. Roth, and B. Schiele
“The Cityscapes Dataset for Semantic Urban Scene Understanding,” in 29th IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2016), Las Vegas, NV, USA, 2016.
634
Conference paper
D2
F. Jug, E. Levinkov, C. Blasse, E. W. Myers, and B. Andres
“Moral Lineage Tracing,” in 29th IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2016), Las Vegas, NV, USA, 2016.
635
Conference paper
D2
A. Khoreva, R. Benenson, M. Omran, M. Hein, and B. Schiele
“Weakly Supervised Object Boundaries,” in 29th IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2016), Las Vegas, NV, USA, 2016.
mehr
Abstract
State-of-the-art learning based boundary detection methods require extensive
training data. Since labelling object boundaries is one of the most expensive
types of annotations, there is a need to relax the requirement to carefully
annotate images to make both the training more affordable and to extend the
amount of training data. In this paper we propose a technique to generate
weakly supervised annotations and show that bounding box annotations alone
suffice to reach high-quality object boundaries without using any
object-specific boundary annotations. With the proposed weak supervision
techniques we achieve the top performance on the object boundary detection
task, outperforming by a large margin the current fully supervised
state-of-the-art methods.
- PDF
- DOI
- PuRe
- BibTeX
636
Conference paper
D2
M. Lapin, M. Hein, and B. Schiele
“Loss Functions for Top-k Error: Analysis and Insights,” in 29th IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2016), Las Vegas, NV, USA, 2016.
- PDF
- DOI
- PuRe
- BibTeX
637
Conference paper
D2
L. Pishchulin, E. Insafutdinov, S. Tang, B. Andres, M. Andriluka, P. Gehler, and B. Schiele
“DeepCut: Joint Subset Partition and Labeling for Multi Person Pose Estimation,” in 29th IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2016), Las Vegas, NV, USA, 2016.
- PDF
- DOI
- PuRe
- BibTeX
638
Conference paper
D2
S. Reed, Z. Akata, H. Lee, and B. Schiele
“Learning Deep Representations of Fine-Grained Visual Descriptions,” in 29th IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2016), Las Vegas, NV, USA, 2016.
- PuRe
- BibTeX
639
Conference paper
D2
K. Rematas, T. Ritschel, M. Fritz, E. Gavves, and T. Tuytelaars
“Deep Reflectance Maps,” in 29th IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2016), Las Vegas, NV, USA, 2016.
mehr
Abstract
Undoing the image formation process and therefore decomposing appearance into
its intrinsic properties is a challenging task due to the under-constraint
nature of this inverse problem. While significant progress has been made on
inferring shape, materials and illumination from images only, progress in an
unconstrained setting is still limited. We propose a convolutional neural
architecture to estimate reflectance maps of specular materials in natural
lighting conditions. We achieve this in an end-to-end learning formulation that
directly predicts a reflectance map from the image itself. We show how to
improve estimates by facilitating additional supervision in an indirect scheme
that first predicts surface orientation and afterwards predicts the reflectance
map by a learning-based sparse data interpolation.
In order to analyze performance on this difficult task, we propose a new
challenge of Specular MAterials on SHapes with complex IllumiNation (SMASHINg)
using both synthetic and real images. Furthermore, we show the application of
our method to a range of image-based editing tasks on real images.
640
Conference paper
D2
L. A. Royer, D. L. Richmond, B. Andres, and D. Kainmueller
“Convexity Shape Constraints for Image Segmentation,” in 29th IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2016), Las Vegas, NV, USA, 2016.
641
Conference paper
D2
K. Sikka, G. Sharma, and M. Bartlett
“LOMo: Latent Ordinal Model for Facial Analysis in Videos,” in 29th IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2016), Las Vegas, NV, USA, 2016.
642
Conference paper
D2
R. Stewart and M. Andriluka
“End-to-end People Detection in Crowded Scenes,” in 29th IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2016), Las Vegas, NV, USA, 2016.
643
Conference paper
D2
Y. Xian, Z. Akata, G. Sharma, Q. Nguyen, M. Hein, and B. Schiele
“Latent Embeddings for Zero-shot Classification,” in 29th IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2016), Las Vegas, NV, USA, 2016.
- PDF
- DOI
- PuRe
- BibTeX
644
Conference paper
D2
S. Zhang, R. Benenson, M. Omran, J. Hosang, and B. Schiele
“How Far are We from Solving Pedestrian Detection?,” in 29th IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2016), Las Vegas, NV, USA, 2016.
645
Article
D4D2
H. Rhodin, C. Richardt, D. Casas, E. Insafutdinov, M. Shafiei, H.-P. Seidel, B. Schiele, and C. Theobalt
“EgoCap: Egocentric Marker-less Motion Capture with Two Fisheye Cameras,” ACM Transactions on Graphics (Proc. ACM SIGGRAPH Asia 2016), vol. 35, no. 6, 2016.
646
Conference paper
D2
S. Reed, Z. Akata, S. Mohan, S. Tenka, B. Schiele, and L. Honglak
“Learning What and Where to Draw,” in Advances in Neural Information Processing Systems 29 (NIPS 2016), Barcelona, Spain, 2016.
- PuRe
- BibTeX
647
Conference paper
D2
S. Schneegass, Y. Oualil, and A. Bulling
“SkullConduct: Biometric User Identification on Eyewear Computers Using Bone Conduction Through the Skull,” in CHI 2016, 34th Annual ACM Conference on Human Factors in Computing Systems, San Jose, CA, USA, 2016.
648
Conference paper
D2
P. Xu, Y. Sugano, and A. Bulling
“Spatio-Temporal Modeling and Prediction of Visual Attention in Graphical User Interfaces,” in CHI 2016, 34th Annual ACM Conference on Human Factors in Computing Systems, San Jose, CA, USA, 2016.
649
Conference paper
D2
M. Khamis, F. Alt, M. Hassib, E. von Zezschwitz, R. Hasholzner, and A. Bulling
“GazeTouchPass: Multimodal Authentication Using Gaze and Touch on Mobile Devices,” in CHI 2016 Extended Abstracts, San Jose, CA, USA, 2016.
650
Conference paper
D2
D. Kirst and A. Bulling
“On the Verge: Voluntary Convergences for Accurate and Precise Timing of Gaze Input,” in CHI 2016 Extended Abstracts, San Jose, CA, USA, 2016.
mehr
Abstract
Rotations performed with the index finger and thumb involve some of the most complex motor action among common multi-touch gestures, yet little is known about the factors affecting performance and ergonomics. This note presents results from a study where the angle, direction, diameter, and position of rotations were systematically manipulated. Subjects were asked to perform the rotations as quickly as possible without losing contact with the display, and were allowed to skip rotations that were too uncomfortable. The data show surprising interaction effects among the variables, and help us identify whole categories of rotations that are slow and cumbersome for users.
651
Article
D2
A. Bulling
“Pervasive Attentive User Interfaces,” Computer, vol. 49, no. 1, 2016.
652
Conference paper
D2
W.-C. Chiu, F. Galasso, and M. Fritz
“Towards Segmenting Consumer Stereo Videos: Benchmark, Baselines and Ensembles,” in Computer Vision -- ACCV 2016, Taipei, Taiwan, 2017.
653
Article
D2
G. Sharma and F. Jurie
“Local Higher-order Statistics (LHS) Describing Images with Statistics of Local Non-binarized Pixel Patterns,” Computer Vision and Image Understanding, vol. 142, 2016.
- PDF
- DOI
- PuRe
- BibTeX
654
Conference paper
D2
T. Beier, B. Andres, U. Köthe, and F. A. Hamprecht
“An Efficient Fusion Move Algorithm for the Minimum Cost Lifted Multicut Problem,” in Computer Vision - ECCV 2016, Amsterdam, The Netherlands, 2016.
655
Conference paper
D2
L. A. Hendricks, Z. Akata, M. Rohrbach, J. Donahue, B. Schiele, and T. Darrell
“Generating Visual Explanations,” in Computer Vision -- ECCV 2016, Amsterdam, The Netherlands, 2016.
mehr
Abstract
Clearly explaining a rationale for a classification decision to an end-user
can be as important as the decision itself. Existing approaches for deep visual
recognition are generally opaque and do not output any justification text;
contemporary vision-language models can describe image content but fail to take
into account class-discriminative image aspects which justify visual
predictions. We propose a new model that focuses on the discriminating
properties of the visible object, jointly predicts a class label, and explains
why the predicted label is appropriate for the image. We propose a novel loss
function based on sampling and reinforcement learning that learns to generate
sentences that realize a global sentence property, such as class specificity.
Our results on a fine-grained bird species classification dataset show that our
model is able to generate explanations which are not only consistent with an
image but also more discriminative than descriptions produced by existing
captioning methods.
656
Conference paper
D2
E. Insafutdinov, L. Pishchulin, B. Andres, M. Andriluka, and B. Schiele
“DeeperCut: A Deeper, Stronger, and Faster Multi-Person Pose Estimation Model,” in Computer Vision -- ECCV 2016, Amsterdam, The Netherlands, 2016.
mehr
Abstract
The goal of this paper is to advance the state-of-the-art of articulated pose
estimation in scenes with multiple people. To that end we contribute on three
fronts. We propose (1) improved body part detectors that generate effective
bottom-up proposals for body parts; (2) novel image-conditioned pairwise terms
that allow to assemble the proposals into a variable number of consistent body
part configurations; and (3) an incremental optimization strategy that explores
the search space more efficiently thus leading both to better performance and
significant speed-up factors. We evaluate our approach on two single-person and
two multi-person pose estimation benchmarks. The proposed approach
significantly outperforms best known multi-person pose estimation results while
demonstrating competitive performance on the task of single person pose
estimation. Models and code available at pose.mpi-inf.mpg.de
657
Conference paper
D2
S. J. Oh, R. Benenson, M. Fritz, and B. Schiele
“Faceless Person Recognition: Privacy Implications in Social Media,” in Computer Vision -- ECCV 2016, Amsterdam, The Netherlands, 2016.
- PDF
- DOI
- PuRe
- BibTeX
658
Conference paper
D2
A. Rohrbach, M. Rohrbach, R. Hu, T. Darrell, and B. Schiele
“Grounding of Textual Phrases in Images by Reconstruction,” in Computer Vision -- ECCV 2016, Amsterdam, The Netherlands, 2016.
- PDF
- DOI
- PuRe
- BibTeX
659
Conference paper
D2
E. Wood, T. Baltrušaitis, L.-P. Morency, P. Robinson, and A. Bulling
“A 3D Morphable Eye Region Model for Gaze Estimation,” in Computer Vision -- ECCV 2016, Amsterdam, The Netherlands, 2016.
660
Conference paper
D2
A. Sharma, O. Grau, and M. Fritz
“VConv-DAE: Deep Volumetric Shape Learning Without Object Labels,” in Computer Vision - ECCV 2016 Workshops, Amsterdam, The Netherlands, 2016.
mehr
Abstract
With the advent of affordable depth sensors, 3D capture becomes more and more
ubiquitous and already has made its way into commercial products. Yet,
capturing the geometry or complete shapes of everyday objects using scanning
devices (eg. Kinect) still comes with several challenges that result in noise
or even incomplete shapes. Recent success in deep learning has shown how to
learn complex shape distributions in a data-driven way from large scale 3D CAD
Model collections and to utilize them for 3D processing on volumetric
representations and thereby circumventing problems of topology and
tessellation. Prior work has shown encouraging results on problems ranging from
shape completion to recognition. We provide an analysis of such approaches and
discover that training as well as the resulting representation are strongly and
unnecessarily tied to the notion of object labels. Furthermore, deep learning
research argues ~\cite{Vincent08} that learning representation with
over-complete model are more prone to overfitting compared to the approach that
learns from noisy data. Thus, we investigate a full convolutional volumetric
denoising auto encoder that is trained in a unsupervised fashion. It
outperforms prior work on recognition as well as more challenging tasks like
denoising and shape completion. In addition, our approach is atleast two order
of magnitude faster at test time and thus, provides a path to scaling up 3D
deep learning.
661
Conference paper
D2
S. Tang, B. Andres, M. Andriluka, and B. Schiele
“Multi-Person Tracking by Multicut and Deep Matching,” in Computer Vision - ECCV 2016 Workshops, Amsterdam, The Netherlands, 2016.
662
Conference paper
D2
A. Khoreva, R. Benenson, F. Galasso, M. Hein, and B. Schiele
“Improved Image Boundaries for Better Video Segmentation,” in Computer Vision -- ECCV 2016 Workshops, Amsterdam, The Netherlands, 2016.
mehr
Abstract
Graph-based video segmentation methods rely on superpixels as starting point.
While most previous work has focused on the construction of the graph edges and
weights as well as solving the graph partitioning problem, this paper focuses
on better superpixels for video segmentation. We demonstrate by a comparative
analysis that superpixels extracted from boundaries perform best, and show that
boundary estimation can be significantly improved via image and time domain
cues. With superpixels generated from our better boundaries we observe
consistent improvement for two video segmentation methods in two different
datasets.
- PDF
- DOI
- PuRe
- BibTeX
663
Proceedings
D2
A. Bulling, O. Cakmakci, K. Kunze, and J. M. Rehg
Eds., Eyewear Computing -- Augmenting the Human with Head-mounted Wearable Assistants, no. 1. Schloss Dagstuhl, 2016.
664
Conference paper
D2
F. Alt, A. Bulling, L. Mecke, and D. Buschek
“Attention, please!: Comparing Features for Measuring Audience Attention Towards Pervasive Displays,” in DIS 2016, 11th ACM SIGCHI Designing Interactive Systems Conference, Brisbane, Australia, 2016.
665
Book chapter / section
D2
Y. Sato, Y. Sugano, A. Sugimoto, Y. Kuno, and H. Koike
“Sensing and Controlling Human Gaze in Daily Living Space for Human-Harmonized Information Environments,” in Human-Harmonized Information Technology, Tokyo: Springer, 2016.
666
Conference paper
D2
M. Dhuliawala, J. Lee, J. Shimizu, A. Bulling, K. Kunze, T. Starner, and W. Woo
“Smooth Eye Movement Interaction Using EOG Glasses,” in ICMI’16, 18th ACM International Conference on Multimodal Interaction, Tokyo, Japan, 2016.
667
Conference paper
D2
S. Nag Chowdhury, M. Malinowski, A. Bulling, and M. Fritz
“Xplore-M-Ego: Contextual Media Retrieval Using Natural Language Queries,” in ICMR’16, ACM International Conference on Multimedia Retrieval, New York, NY, USA, 2016.
- PDF
- DOI
- PuRe
- BibTeX
668
Conference poster
D2
M. Malinowski, M. Rohrbach, and M. Fritz
“Ask Your Neurons Again: Analysis of Deep Methods with Global Image Representation,” IEEE Conference on Computer Vision and Pattern Recognition Workshops (VQA 2016). IEEE, Piscataway, NJ.
mehr
Abstract
We are addressing an open-ended question answering task
about real-world images. With the help of currently available methods
developed in Computer Vision and Natural Language Processing, we would
like to push an architecture with a global visual representation to its
limits. In our contribution, we show how to achieve competitive
performance on VQA with global visual features (Residual Net) together
with a carefully desgined architecture.
- PuRe
- BibTeX
669
Conference paper
D2
B. Bhattarai, G. Sharma, A. Lechervy, and F. Jurie
“A Joint Learning Approach for Cross Domain Age Estimation,” in IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2016), Shanghai, China, 2016.
670
Article
D2
H. Oh Song, M. Fritz, D. Goehring, and T. Darell
“Learning to Detect Visual Grasp Affordance,” IEEE Transactions on Automation Science and Engineering, vol. 13, no. 2, 2016.
- PDF
- DOI
- PuRe
- BibTeX
671
Article
D2
Z. Akata, F. Perronnin, Z. Harchaoui, and C. Schmid
“Label-Embedding for Image Classification,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 38, no. 7, 2016.
672
Article
D2
V. Belagiannis, S. Amin, M. Andriluka, B. Schiele, N. Navab, and S. Ilic
“3D Pictorial Structures Revisited: Multiple Human Pose Estimation,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 38, no. 10, 2016.
673
Article
D2
J. Deng, J. Krause, M. Stark, and L. Fei-Fei
“Leveraging the Wisdom of the Crowd for Fine-Grained Recognition,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 38, no. 4, 2016.
674
Article
D2
J. Hosang, R. Benenson, P. Dollár, and B. Schiele
“What Makes for Effective Detection Proposals?,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 38, no. 4, 2016.
675
Article
D2
E. T. Turetken, F. Benmansour, B. Andres, P. Głowacki, and H. Pfister
“Reconstructing Curvilinear Networks using Path Classifiers and Integer Programming,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 38, no. 12, 2016.
676
Conference paper
D2
D. Pohl, X. Zhang, and A. Bulling
“Combining Eye Tracking with Optimizations for Lens Astigmatism in modern wide-angle HMDs,” in 2016 IEEE Virtual Reality Conference (VR), Greenville, SC, USA, 2016.
677
Conference paper
D2
W. Li and M. Fritz
“Recognition of Ongoing Complex Activities by Sequence Prediction Over a Hierarchical Label Space,” in 2016 IEEE Winter Conference on Applications of Computer Vision (WACV 2016), Lake Placid, NY, USA, 2016.
678
Article
D2
A. Bulling and K. Kunze
“Eyewear Computers for Human-Computer Interaction,” Interactions, vol. 23, no. 3, 2016.
679
Article
D2
H. Jeong, D. Saakes, U. Lee, A. Esteves, E. Velloso, A. Bulling, K. Masai, Y. Sugiura, M. Ogata, K. Kunze, M. Inami, M. Sugimoto, A. Rathnayake, and T. Dias
“Demo hour,” Interactions, vol. 23, no. 1, 2016.
680
Article
D2
M. Rohrbach, A. Rohrbach, M. Regneri, S. Amin, M. Andriluka, M. Pinkal, and B. Schiele
“Recognizing Fine-grained and Composite Activities Using Hand-centric Features and Script Data,” International Journal of Computer Vision, vol. 119, no. 3, 2016.
- PDF
- DOI
- PuRe
- BibTeX
681
Proceedings
D2
B. Rosenhahn and B. Andres
Eds., Pattern Recognition. Springer, 2016.
682
Article
D2
W. Fuhl, M. Tonsen, A. Bulling, and E. Kasneci
“Pupil Detection for Head-mounted Eye Tracking in the Wild: An Evaluation of the State of the Art,” Machine Vision and Applications, vol. 27, no. 8, 2016.
683
Conference paper
D2
M. Rempfler, B. Andres, and B. H. Menze
“The Minimum Cost Connected Subgraph Problem in Medical Image Analysis,” in Medical Image Computing and Computer-Assisted Intervention -- MICCAI 2016, Athens, Greece, 2016.
684
Conference paper
D2
P. Aditya, R. Sen, P. Druschel, S. J. Oh, R. Benenson, M. Fritz, B. Schiele, B. Bhattachariee, and T. T. Wu
“Demo: I-Pic: A Platform for Privacy-Compliant Image Capture,” in MobiSys’16, 4th Annual International Conference on Mobile Systems, Applications, and Services, Singapore, 2016.
685
Conference paper
D2
P. Aditya, R. Sen, P. Druschel, S. J. Oh, R. Benenson, M. Fritz, B. Schiele, B. Bhattachariee, and T. T. Wu
“I-Pic: A Platform for Privacy-Compliant Image Capture,” in MobiSys’16, 4th Annual International Conference on Mobile Systems, Applications, and Services, Singapore, 2016.
686
Conference paper
D2
P. Aditya, R. Sen, P. Druschel, S. J. Oh, R. Benenson, M. Fritz, B. Schiele, B. Bhattachariee, and T. T. Wu
“I-Pic: A Platform for Privacy-Compliant Image Capture,” in MobiSys’16, 4th Annual International Conference on Mobile Systems, Applications, and Services, Singapore, 2016.
687
Conference paper
D2
A. Bhattacharyya, M. Malinowski, and M. Fritz
“Long Term Boundary Extrapolation for Deterministic Motion,” in NIPS Workshop on Intuitive Physics, Barcelona, Spain, 2016.
- PuRe
- BibTeX
688
Conference paper
D2
J. Hosang, R. Benenson, and B. Schiele
“A Convnet for Non-maximum Suppression,” in Pattern Recognition (GCPR 2016), Hannover, Germany, 2016.
mehr
Abstract
Non-maximum suppression (NMS) is used in virtually all state-of-the-art
object detection pipelines. While essential object detection ingredients such
as features, classifiers, and proposal methods have been extensively researched
surprisingly little work has aimed to systematically address NMS. The de-facto
standard for NMS is based on greedy clustering with a fixed distance threshold,
which forces to trade-off recall versus precision. We propose a convnet
designed to perform NMS of a given set of detections. We report experiments on
a synthetic setup, and results on crowded pedestrian detection scenes. Our
approach overcomes the intrinsic limitations of greedy NMS, obtaining better
recall and precision.
- PDF
- DOI
- PuRe
- BibTeX
689
Conference paper
D2
J. Scheer, M. Fritz, and O. Grau
“Learning to Select Long-Track Features for Structure-From-Motion and Visual SLAM,” in Pattern Recognition (GCPR 2016), Hannover, Germany, 2016.
690
Conference paper
D2
I. Shcherbatyi and B. Andres
“Convexification of Learning from Constraints,” in Pattern Recognition (GCPR 2016), Hannover, Germany, 2016.
691
Article
D2
D. J. Cook, A. Bulling, and Z. Yu
“Special Issue Introduction,” Pervasive and Mobile Computing (Proc. PerCom 2015), vol. 26, 2016.
692
Conference paper
D2
M. Barz, F. Daiber, and A. Bulling
“Prediction of Gaze Estimation Error for Error-Aware Gaze-Based Interfaces,” in Proceedings ETRA 2016, Charleston, SC, USA, 2016.
693
Conference paper
D2
M. Mansouryar, J. Steil, Y. Sugano, and A. Bulling
“3D Gaze Estimation from 2D Pupil Positions on Monocular Head-Mounted Eye Trackers,” in Proceedings ETRA 2016, Charleston, SC, USA, 2016.
694
Conference paper
D2
L. Sesma-Sanchez, Y. Zhang, H. Gellersen, and A. Bulling
“Gaussian Processes as an Alternative to Polynomial Gaze Estimation Functions,” in Proceedings ETRA 2016, Charleston, SC, USA, 2016.
695
Conference paper
D2
M. Tonsen, X. Zhang, Y. Sugano, and A. Bulling
“Labelled Pupils in the Wild: A Dataset for Studying Pupil Detection in Unconstrained Environments,” in Proceedings ETRA 2016, Charleston, SC, USA, 2016.
696
Conference paper
D2
E. Wood, T. Baltrušaitis, L.-P. Morency, P. Robinson, and A. Bulling
“Learning an Appearance-based Gaze Estimator from One Million Synthesised Images,” in Proceedings ETRA 2016, Charleston, SC, USA, 2016.
697
Conference paper
D2
F. Alt, M. Mikusz, S. Schneegass, and A. Bulling
“Long-term Memorability of Cued-Recall Graphical Passwords with Saliency Masks,” in Proceedings of the 15th International Conference on Mobile and Ubiquitous Multimedia (MUM 2016), Rovaniemi, Finland, 2016.
698
Conference paper
D2
M. Khamis, L. Trotter, M. Tessman, C. Dannhart, A. Bulling, and F. Alt
“EyeVote in the Wild: Do Users bother Correcting System Errors on Public Displays?,” in Proceedings of the 15th International Conference on Mobile and Ubiquitous Multimedia (MUM 2016), Rovaniemi, Finland, 2016.
699
Conference paper
D2
S. Reed, Z. Akata, X. Yan, L. Logeswaran, B. Schiele, and H. Lee
“Generative Adversarial Text to Image Synthesis,” in Proceedings of the 33rd International Conference on Machine Learning (ICML 2016), New York, NY, USA, 2016.
- PuRe
- BibTeX
700
Conference paper
D2
A. Mokarian Forooshani, M. Malinowski, and M. Fritz
“Mean Box Pooling: A Rich Image Representation and Output Embedding for the Visual Madlibs Task,” in Proceedings of the British Machine Vision Conference (BMVC 2016), York, UK, 2016.
- PDF
- DOI
- PuRe
- BibTeX
701
Conference paper
D2
A. Fukui, D. H. Park, D. Yang, A. Rohrbach, T. Darrell, and M. Rohrbach
“Multimodal Compact Bilinear Pooling for Visual Question Answering and Visual Grounding,” in Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP 2016), Austin, TX, USA, 2016.
702
Conference paper
D2
A. L. Simeone, A. Bulling, J. Alexander, and H. Gellersen
“Three-Point Interaction: Combining Bi-manual Direct Touch with Gaze,” in Proceedings of the 2016 International Working Conference on Advanced Visual Interfaces (AVI 2016), Bari, Italy, 2016.
703
Conference paper
D5D2
N. Tandon, C. D. Hariman, J. Urbani, A. Rohrbach, M. Rohrbach, and G. Weikum
“Commonsense in Parts: Mining Part-Whole Relations from the Web and Image Tags,” in Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, Phoenix, AZ, USA, 2016.
704
Conference paper
D2
D. Pohl, X. Zhang, A. Bulling, and O. Grau
“Concept for Using Eye Tracking in a Head-mounted Display to Adapt Rendering to the User’s Current Visual Field,” in Proceedings VRST 2016, Munich, Germany, 2016.
705
Book chapter / section
D2
M. Stark, B. Schiele, and A. Leonardis
“Visual Object Class Recognition,” in Springer Handbook of Robotics, 2nd ed., New York, NY: Springer, 2016.
706
Conference paper
D2
E. Levinkov, J. Tompkin, N. Bonneel, S. Kirchhoff, B. Andres, and H. Pfister
“Interactive Multicut Video Segmentation,” in The 24th Pacific Conference on Computer Graphics and Applications Short Papers Proceedings (Pacific Graphics 2016), Okinawa, Japan, 2016.
707
Conference paper
D2
M. Khamis, O. Saltuk, A. Hang, K. Stolz, A. Bulling, and F. Alt
“TextPursuits: Using Text for Pursuits-based Interaction and Calibration on Public Displays,” in UbiComp’16, ACM International Joint Conference on Pervasive and Ubiquitous Computing, Heidelberg, Germany, 2016.
708
Conference paper
D2
A. Bulling, O. Cakmakci, K. Kunze, and J. M. Rehg
“EyeWear 2016: First Workshop on EyeWear Computing,” in UbiComp’16 Adjunct, Heidelberg, Germany, 2016.
709
Conference paper
D2
M. Khamis, F. Alt, and A. Bulling
“Challenges and Design Space of Gaze-enabled Public Displays,” in UbiComp’16 Adjunct, Heidelberg, Germany, 2016.
710
Conference paper
D2
J. Shimizu, J. Lee, M. Dhuliawala, A. Bulling, T. Starner, W. Woo, and K. Kunze
“Solar System: Smooth Pursuit Interactions Using EOG Glasses,” in UbiComp’16 Adjunct, Heidelberg, Germany, 2016.
711
Conference paper
D2
Y. Sugano, X. Zhang, and A. Bulling
“AggreGaze: Collective Estimation of Audience Attention on Public Displays,” in UIST 2016, 29th Annual Symposium on User Interface Software and Technology, Tokyo, Japan, 2016.
712
Paper
D2
A. Bhattacharyya, M. Malinowski, and M. Fritz
“Spatio-Temporal Image Boundary Extrapolation,” 2016. [Online]. Available: http://arxiv.org/abs/1605.07363.
mehr
Abstract
Boundary prediction in images as well as video has been a very active topic
of research and organizing visual information into boundaries and segments is
believed to be a corner stone of visual perception. While prior work has
focused on predicting boundaries for observed frames, our work aims at
predicting boundaries of future unobserved frames. This requires our model to
learn about the fate of boundaries and extrapolate motion patterns. We
experiment on established real-world video segmentation dataset, which provides
a testbed for this new task. We show for the first time spatio-temporal
boundary extrapolation in this challenging scenario. Furthermore, we show
long-term prediction of boundaries in situations where the motion is governed
by the laws of physics. We successfully predict boundaries in a billiard
scenario without any assumptions of a strong parametric model or any object
notion. We argue that our model has with minimalistic model assumptions derived
a notion of 'intuitive physics' that can be applied to novel scenes.
713
Thesis
D2IMPR-CS
W.-C. Chiu
“Bayesian Non-Parametrics for Multi-Modal Segmentation,” Universität des Saarlandes, Saarbrücken, 2016.
714
Paper
D2D4
S. Georgoulis, K. Rematas, T. Ritschel, M. Fritz, T. Tuytelaars, and L. Van Gool
“Natural Illumination from Multiple Materials Using Deep Learning,” 2016. [Online]. Available: http://arxiv.org/abs/1611.09325.
mehr
Abstract
Recovering natural illumination from a single Low-Dynamic Range (LDR) image
is a challenging task. To remedy this situation we exploit two properties often
found in everyday images. First, images rarely show a single material, but
rather multiple ones that all reflect the same illumination. However, the
appearance of each material is observed only for some surface orientations, not
all. Second, parts of the illumination are often directly observed in the
background, without being affected by reflection. Typically, this directly
observed part of the illumination is even smaller. We propose a deep
Convolutional Neural Network (CNN) that combines prior knowledge about the
statistics of illumination and reflectance with an input that makes explicit
use of these two observations. Our approach maps multiple partial LDR material
observations represented as reflectance maps and a background image to a
spherical High-Dynamic Range (HDR) illumination map. For training and testing
we propose a new data set comprising of synthetic and real images with multiple
materials observed under the same illumination. Qualitative and quantitative
evidence shows how both multi-material and using a background are essential to
improve illumination estimations.
- PuRe
- BibTeX
715
Paper
D2
S. Georgoulis, K. Rematas, T. Ritschel, M. Fritz, L. Van Gool, and T. Tuytelaars
“DeLight-Net: Decomposing Reflectance Maps into Specular Materials and Natural Illumination,” 2016. [Online]. Available: http://arxiv.org/abs/1603.08240.
mehr
Abstract
In this paper we are extracting surface reflectance and natural environmental
illumination from a reflectance map, i.e. from a single 2D image of a sphere of
one material under one illumination. This is a notoriously difficult problem,
yet key to various re-rendering applications. With the recent advances in
estimating reflectance maps from 2D images their further decomposition has
become increasingly relevant.
To this end, we propose a Convolutional Neural Network (CNN) architecture to
reconstruct both material parameters (i.e. Phong) as well as illumination (i.e.
high-resolution spherical illumination maps), that is solely trained on
synthetic data. We demonstrate that decomposition of synthetic as well as real
photographs of reflectance maps, both in High Dynamic Range (HDR), and, for the
first time, on Low Dynamic Range (LDR) as well. Results are compared to
previous approaches quantitatively as well as qualitatively in terms of
re-renderings where illumination, material, view or shape are changed.
- PuRe
- BibTeX
716
Paper
D2
Y. He, W.-C. Chiu, M. Keuper, and M. Fritz
“RGBD Semantic Segmentation Using Spatio-Temporal Data-Driven Pooling,” 2016. [Online]. Available: http://arxiv.org/abs/1604.02388.
mehr
Abstract
Beyond the success in classification, neural networks have recently shown
strong results on pixel-wise prediction tasks like image semantic segmentation
on RGBD data. However, the commonly used deconvolutional layers for upsampling
intermediate representations to the full-resolution output still show different
failure modes, like imprecise segmentation boundaries and label mistakes in
particular on large, weakly textured objects (e.g. fridge, whiteboard, door).
We attribute these errors in part to the rigid way, current network aggregate
information, that can be either too local (missing context) or too global
(inaccurate boundaries). Therefore we propose a data-driven pooling layer that
integrates with fully convolutional architectures and utilizes boundary
detection from RGBD image segmentation approaches. We extend our approach to
leverage region-level correspondences across images with an additional temporal
pooling stage. We evaluate our approach on the NYU-Depth-V2 dataset comprised
of indoor RGBD video sequences and compare it to various state-of-the-art
baselines. Besides a general improvement over the state-of-the-art, our
approach shows particularly good results in terms of accuracy of the predicted
boundaries and in segmenting previously problematic classes.
- PuRe
- BibTeX
717
Paper
D2
S. Hoppe and A. Bulling
“End-to-End Eye Movement Detection Using Convolutional Neural Networks,” 2016. [Online]. Available: http://arxiv.org/abs/1609.02452.
mehr
Abstract
Common computational methods for automated eye movement detection - i.e. the
task of detecting different types of eye movement in a continuous stream of
gaze data - are limited in that they either involve thresholding on
hand-crafted signal features, require individual detectors each only detecting
a single movement, or require pre-segmented data. We propose a novel approach
for eye movement detection that only involves learning a single detector
end-to-end, i.e. directly from the continuous gaze data stream and
simultaneously for different eye movements without any manual feature crafting
or segmentation. Our method is based on convolutional neural networks (CNN)
that recently demonstrated superior performance in a variety of tasks in
computer vision, signal processing, and machine learning. We further introduce
a novel multi-participant dataset that contains scripted and free-viewing
sequences of ground-truth annotated saccades, fixations, and smooth pursuits.
We show that our CNN-based method outperforms state-of-the-art baselines by a
large margin on this challenging dataset, thereby underlining the significant
potential of this approach for holistic, robust, and accurate eye movement
protocol analysis.
- PuRe
- BibTeX
718
Paper
D2
M. Keuper, S. Tang, Z. Yu, B. Andres, T. Brox, and B. Schiele
“A Multi-cut Formulation for Joint Segmentation and Tracking of Multiple Objects,” 2016. [Online]. Available: http://arxiv.org/abs/1607.06317.
mehr
Abstract
Recently, Minimum Cost Multicut Formulations have been proposed and proven to
be successful in both motion trajectory segmentation and multi-target tracking
scenarios. Both tasks benefit from decomposing a graphical model into an
optimal number of connected components based on attractive and repulsive
pairwise terms. The two tasks are formulated on different levels of granularity
and, accordingly, leverage mostly local information for motion segmentation and
mostly high-level information for multi-target tracking. In this paper we argue
that point trajectories and their local relationships can contribute to the
high-level task of multi-target tracking and also argue that high-level cues
from object detection and tracking are helpful to solve motion segmentation. We
propose a joint graphical model for point trajectories and object detections
whose Multicuts are solutions to motion segmentation {\it and} multi-target
tracking problems at once. Results on the FBMS59 motion segmentation benchmark
as well as on pedestrian tracking sequences from the 2D MOT 2015 benchmark
demonstrate the promise of this joint approach.
- PuRe
- BibTeX
719
Paper
D2
W. Li, S. Azimi, A. Leonardis, and M. Fritz
“To Fall Or Not To Fall: A Visual Approach to Physical Stability Prediction,” 2016. [Online]. Available: http://arxiv.org/abs/1604.00066.
mehr
Abstract
Understanding physical phenomena is a key competence that enables humans and
animals to act and interact under uncertain perception in previously unseen
environments containing novel object and their configurations. Developmental
psychology has shown that such skills are acquired by infants from observations
at a very early stage.
In this paper, we contrast a more traditional approach of taking a
model-based route with explicit 3D representations and physical simulation by
an end-to-end approach that directly predicts stability and related quantities
from appearance. We ask the question if and to what extent and quality such a
skill can directly be acquired in a data-driven way bypassing the need for an
explicit simulation.
We present a learning-based approach based on simulated data that predicts
stability of towers comprised of wooden blocks under different conditions and
quantities related to the potential fall of the towers. The evaluation is
carried out on synthetic data and compared to human judgments on the same
stimuli.
- PuRe
- BibTeX
720
Paper
D2
M. Malinowski and M. Fritz
“Tutorial on Answering Questions about Images with Deep Learning,” 2016. [Online]. Available: http://arxiv.org/abs/1610.01076.
mehr
Abstract
Together with the development of more accurate methods in Computer Vision and
Natural Language Understanding, holistic architectures that answer on questions
about the content of real-world images have emerged. In this tutorial, we build
a neural-based approach to answer questions about images. We base our tutorial
on two datasets: (mostly on) DAQUAR, and (a bit on) VQA. With small tweaks the
models that we present here can achieve a competitive performance on both
datasets, in fact, they are among the best methods that use a combination of
LSTM with a global, full frame CNN representation of an image. We hope that
after reading this tutorial, the reader will be able to use Deep Learning
frameworks, such as Keras and introduced Kraino, to build various architectures
that will lead to a further performance improvement on this challenging task.
721
Paper
D2
D. H. Park, L. A. Hendricks, Z. Akata, B. Schiele, T. Darrell, and M. Rohrbach
“Attentive Explanations: Justifying Decisions and Pointing to the Evidence,” 2016. [Online]. Available: http://arxiv.org/abs/1612.04757.
mehr
Abstract
Deep models are the defacto standard in visual decision models due to their
impressive performance on a wide array of visual tasks. However, they are
frequently seen as opaque and are unable to explain their decisions. In
contrast, humans can justify their decisions with natural language and point to
the evidence in the visual world which led to their decisions. We postulate
that deep models can do this as well and propose our Pointing and Justification
(PJ-X) model which can justify its decision with a sentence and point to the
evidence by introspecting its decision and explanation process using an
attention mechanism. Unfortunately there is no dataset available with reference
explanations for visual decision making. We thus collect two datasets in two
domains where it is interesting and challenging to explain decisions. First, we
extend the visual question answering task to not only provide an answer but
also a natural language explanation for the answer. Second, we focus on
explaining human activities which is traditionally more challenging than object
classification. We extensively evaluate our PJ-X model, both on the
justification and pointing tasks, by comparing it to prior models and ablations
using both automatic and human evaluations.
- PuRe
- BibTeX
722
Thesis
D2IMPR-CSD4
L. Pishchulin
“Articulated People Detection and Pose Estimation in Challenging Real World Environments,” Universität des Saarlandes, Saarbrücken, 2016.
723
Paper
D4D2
H. Rhodin, C. Richardt, D. Casas, E. Insafutdinov, M. Shafiei, H.-P. Seidel, B. Schiele, and C. Theobalt
“EgoCap: Egocentric Marker-less Motion Capture with Two Fisheye Cameras (Extended Abstract),” 2016. [Online]. Available: http://arxiv.org/abs/1701.00142.
mehr
Abstract
Marker-based and marker-less optical skeletal motion-capture methods use an
outside-in arrangement of cameras placed around a scene, with viewpoints
converging on the center. They often create discomfort by possibly needed
marker suits, and their recording volume is severely restricted and often
constrained to indoor scenes with controlled backgrounds. We therefore propose
a new method for real-time, marker-less and egocentric motion capture which
estimates the full-body skeleton pose from a lightweight stereo pair of fisheye
cameras that are attached to a helmet or virtual-reality headset. It combines
the strength of a new generative pose estimation framework for fisheye views
with a ConvNet-based body-part detector trained on a new automatically
annotated and augmented dataset. Our inside-in method captures full-body motion
in general indoor and outdoor scenes, and also crowded scenes.
- PuRe
- BibTeX
724
Paper
D2
Y. Sugano and A. Bulling
“Seeing with Humans: Gaze-Assisted Neural Image Captioning,” 2016. [Online]. Available: http://arxiv.org/abs/1608.05203.
mehr
Abstract
Gaze reflects how humans process visual scenes and is therefore increasingly
used in computer vision systems. Previous works demonstrated the potential of
gaze for object-centric tasks, such as object localization and recognition, but
it remains unclear if gaze can also be beneficial for scene-centric tasks, such
as image captioning. We present a new perspective on gaze-assisted image
captioning by studying the interplay between human gaze and the attention
mechanism of deep neural networks. Using a public large-scale gaze dataset, we
first assess the relationship between state-of-the-art object and scene
recognition models, bottom-up visual saliency, and human gaze. We then propose
a novel split attention model for image captioning. Our model integrates human
gaze information into an attention-based long short-term memory architecture,
and allows the algorithm to allocate attention selectively to both fixated and
non-fixated image regions. Through evaluation on the COCO/SALICON datasets we
show that our method improves image captioning performance and that gaze can
complement machine attention for semantic scene understanding tasks.
- PuRe
- BibTeX

2015

725
Conference paper
D2
N. Koleva, S. Hoppe, M. M. Moniri, M. Staudte, and A. Bulling
“On the Interplay between Spontaneous Spoken Instructions and Human Visual Behaviour in an Indoor Guidance Task,” in 37th Annual Meeting of the Cognitive Science Society (COGSCI 2015), Pasadena, CA, USA, 2015.
726
Conference paper
D2
A. Khan, I. Steiner, R. G. Macdonald, Y. Sugano, and A. Bulling
“Scene Viewing and Gaze Analysis during Phonetic Segmentation Tasks,” in Abstracts of the 18th European Conference on Eye Movements (ECEM 2015), Vienna, Austria, 2015.
727
Article
D2
E. Velloso, D. Schmidt, J. Alexander, H. Gellersen, and A. Bulling
“The Feet in Human-Computer Interaction: A Survey of Foot-Based Interaction,” ACM Computing Surveys, vol. 48, no. 2, 2015.
- PDF
- DOI
- PuRe
- BibTeX
728
Article
D2
A. Bulling, U. Blanke, D. Tan, J. Rekimoto, and G. Abowd
“Introduction to the Special Issue on Activity Recognition for Interaction,” ACM Transactions on Interactive Intelligent Systems, vol. 4, no. 4, 2015.
- PDF
- DOI
- PuRe
- BibTeX
729
Conference paper
D2
P. Jawanpuria, M. Lapin, M. Hein, and B. Schiele
“Efficient Output Kernel Learning for Multiple Tasks,” in Advances in Neural Information Processing Systems 28 (NIPS 2015), Montréal, Canada, 2016.
730
Conference paper
D2
M. Lapin, M. Hein, and B. Schiele
“Top-k Multiclass SVM,” in Advances in Neural Information Processing Systems 28 (NIPS 2015), Montréal, Canada, 2016.
731
Conference paper
D2
M. Rempfler, M. Schneider, G. D. Ielacqua, T. Sprenger, X. Xiao, S. R. Stock, J. Klohs, G. Székely, B. Andres, and B. H. Menze
“Rekonstruktion zerebraler Gefässnetzwerke aus in-vivo μMRA mittels physiologischem Vorwissen zur lokalen Gefässgeometrie,” in Bildverarbeitung für die Medizin 2015 (BVM 2015), Lübeck, Germany, 2015.
732
Article
D2
T. Loetscher, C. Chen, S. Wignall, A. Bulling, S. Hoppe, O. Churches, N. A. Thomas, M. E. R. Nicholls, and A. Lee
“A Study on the Natural History of Scanning Behaviour in Patients with Visual Field Defects after Stroke,” BMC Neurology, vol. 15, 2015.
- PDF
- DOI
- PuRe
- BibTeX
733
Conference paper
D2
J. Turner, J. Alexander, A. Bulling, and H. Gellersen
“Gaze+RST: Integrating Gaze and Multitouch for Remote Rotate-scale-translate Tasks,” in CHI 2015, 33rd Annual ACM Conference on Human Factors in Computing Systems, Seoul, Korea, 2015.
- PDF
- DOI
- PuRe
- BibTeX
734
Conference paper
D2
M. Vidal, R. Bismuth, A. Bulling, and H. Gellersen
“The Royal Corgi: Exploring Social Gaze Interaction for Immersive Gameplay,” in CHI 2015, 33rd Annual ACM Conference on Human Factors in Computing Systems, Seoul, Korea, 2015.
mehr
Abstract
The eyes are a rich channel for non-verbal communication in
our daily interactions. We propose social gaze interaction as a game
mechanic to enhance user interactions with virtual characters. We
develop a game from the ground-up in which characters are esigned to be
reactive to the player’s gaze in social ways, such as etting annoyed
when the player seems distracted or changing their dialogue depending on
the player’s apparent focus of ttention. Results from a qualitative user
study provide insights bout how social gaze interaction is intuitive for
users, elicits deep feelings of immersion, and highlight the players’
self-consciousness of their own eye movements through their strong
reactions to the characters
- PDF
- DOI
- PuRe
- BibTeX
735
Article
D2
S. Savarese, M. Sun, and M. Stark
“Editorial of Special Issue on Shape Representations Meet Visual Recognition,” Computer Vision and Image Understanding, vol. 139, 2015.
736
Report
D2
M. Barz, A. Bulling, and F. Daiber
“Computational Modelling and Prediction of Gaze Estimation Error for Head-mounted Eye Trackers,” DFKI, Saarbrücken, 15-01, 2015.
mehr
Abstract
Head-mounted eye tracking has significant potential for
mobile gaze-based interaction with ambient displays but current
interfaces lack information about the tracker\'s gaze estimation error.
Consequently, current interfaces do not exploit the full potential of
gaze input as the inherent estimation error can not be dealt with. The
error depends on the physical properties of the display and constantly
varies with changes in position and distance of the user to the display.
In this work we present a computational model of gaze estimation error
for head-mounted eye trackers. Our model covers the full processing
pipeline for mobile gaze estimation, namely mapping of pupil positions
to scene camera coordinates, marker-based display detection, and display
mapping. We build the model based on a series of controlled measurements
of a sample state-of-the-art monocular head-mounted eye tracker. Results
show that our model can predict gaze estimation error with a root mean
squared error of 17.99~px ($1.96^\\circ$).
737
Report
D2
C. Lander, S. Gehring, A. Krüger, S. Boring, and A. Bulling
“GazeProjector: Location-independent Gaze Interaction on and Across Multiple Displays,” DFKI, Saarbrücken, 15-01, 2015.
mehr
Abstract
Mobile gaze-based interaction with multiple displays may
occur from arbitrary positions and orientations. However, maintaining
high gaze estimation accuracy still represents a significant challenge.
To address this, we present GazeProjector, a system that combines
accurate point-of-gaze estimation with natural feature tracking on
displays to determine the mobile eye tracker’s position relative to a
display. The detected eye positions are transformed onto that display
allowing for gaze-based interaction. This allows for seamless gaze
estimation and interaction on (1) multiple displays of arbitrary sizes,
(2) independently of the user’s position and orientation to the display.
In a user study with 12 participants we compared GazeProjector to
existing well- established methods such as visual on-screen markers and
a state-of-the-art motion capture system. Our results show that our
approach is robust to varying head poses, orientations, and distances to
the display, while still providing high gaze estimation accuracy across
multiple displays without re-calibration. The system represents an
important step towards the vision of pervasive gaze-based interfaces.
- PuRe
- BibTeX
738
Conference paper
D2
E. Velloso, J. Alexander, A. Bulling, and H. Gellersen
“Interactions Under the Desk: A Characterisation of Foot Movements for Input in a Seated Position,” in Human-Computer Interaction -- INTERACT 2015, Bamberg, Germany, 2015.
- PDF
- DOI
- PuRe
- BibTeX
739
Conference paper
D2
E. Velloso, J. Turner, J. Alexander, A. Bulling, and H. Gellersen
“An Empirical Investigation of Gaze Selection in Mid-Air Gestural 3D Manipulation,” in Human-Computer Interaction -- INTERACT 2015, Bamberg, Germany, 2015.
- PDF
- DOI
- PuRe
- BibTeX
740
Conference paper
D2
W.-C. Chiu and M. Fritz
“See the Difference: Direct Pre-Image Reconstruction and Pose Estimation by Differentiating HOG,” in ICCV 2015, IEEE International Conference on Computer Vision, Santiago, Chile, 2015.
- PDF
- DOI
- PuRe
- BibTeX
741
Conference paper
D2
M. Keuper, E. Levinkov, N. Bonneel, G. Layoue, T. Brox, and B. Andres
“Efficient Decomposition of Image and Mesh Graphs by Lifted Multicuts,” in ICCV 2015, IEEE International Conference on Computer Vision, Santiago, Chile, 2015.
742
Conference paper
D2
M. Keuper, B. Andres, and T. Brox
“Motion Trajectory Segmentation via Minimum Cost Multicuts,” in ICCV 2015, IEEE International Conference on Computer Vision, Santiago, Chile, 2015.
743
Conference paper
D2
M. Malinowski, M. Rohrbach, and M. Fritz
“Ask Your Neurons: A Neural-based Approach to Answering Questions About Images,” in ICCV 2015, IEEE International Conference on Computer Vision, Santiago, Chile, 2015.
- PDF
- DOI
- PuRe
- BibTeX
744
Conference paper
D2
S. J. Oh, R. Benenson, M. Fritz, and B. Schiele
“Person Recognition in Personal Photo Collections,” in ICCV 2015, IEEE International Conference on Computer Vision, Santiago, Chile, 2015.
- PDF
- DOI
- PuRe
- BibTeX
745
Conference paper
D2
G. Sharma and B. Schiele
“Scalable Nonlinear Embeddings for Semantic Category-based Image Retrieval,” in ICCV 2015, IEEE International Conference on Computer Vision, Santiago, Chile, 2015.
- PDF
- DOI
- PuRe
- BibTeX
746
Conference paper
D2
E. Wood, T. Baltrusaitis, X. Zhang, Y. Sugano, P. Robinson, and A. Bulling
“Rendering of Eyes for Eye-Shape Registration and Gaze Estimation,” in ICCV 2015, IEEE International Conference on Computer Vision, Santiago, Chile, 2015.
- PDF
- DOI
- PuRe
- BibTeX
747
Conference paper
D2
Z. Akata, S. Reed, D. Walter, H. Lee, and B. Schiele
“Evaluation of Output Embeddings for Fine-grained Image Classification,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2015), Boston, MA, USA, 2015.
- PDF
- DOI
- PuRe
- BibTeX
748
Conference paper
D2
C. Choy, M. Stark, and S. Savarese
“Enriching Object Detection with 2D-3D Registration and Continuous Viewpoint Estimation,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2015), Boston, MA, USA, 2015.
- PDF
- DOI
- PuRe
- BibTeX
749
Conference paper
D4D2
A. Elhayek, E. de Aguiar, J. Tompson, A. Jain, L. Pishchulin, M. Andriluka, C. Bregler, B. Schiele, and C. Theobalt
“Efficient ConvNet-based Marker-less Motion Capture in General Scenes with a Low Number of Cameras,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2015), Boston, MA, USA, 2015.
- PDF
- DOI
- PuRe
- BibTeX
750
Conference paper
D2
J. Hosang, M. Omran, R. Benenson, and B. Schiele
“Taking a Deeper Look at Pedestrians,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2015), Boston, MA, USA, 2015.
- PDF
- DOI
- PuRe
- BibTeX
751
Conference paper
D2
J. Johnson, R. Krishna, M. Stark, J. Li, M. Bernstein, and L. Fei-Fei
“Image Retrieval using Scene Graphs,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2015), Boston, MA, USA, 2015.
- PDF
- DOI
- PuRe
- BibTeX
752
Conference paper
D2
A. Khoreva, F. Galasso, M. Hein, and B. Schiele
“Classifier Based Graph Construction for Video Segmentation,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2015), Boston, MA USA, 2015.
- PDF
- DOI
- PuRe
- BibTeX
753
Conference paper
D2
Q. N. Nguyen, A. Gautier, and M. Hein
“A Flexible Tensor Block Coordinate Ascent Scheme for Hypergraph Matching,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2015), Boston, MA, USA, 2015.
754
Conference paper
D2D5
A. Rohrbach, M. Rohrbach, N. Tandon, and B. Schiele
“A Dataset for Movie Description,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2015), Boston, MA, USA, 2015.
- PDF
- DOI
- PuRe
- BibTeX
755
Conference paper
D2
H. Sattar, S. Müller, M. Fritz, and A. Bulling
“Prediction of Search Targets from Fixations in Open-world Settings,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2015), Boston, MA, USA, 2015.
756
Conference paper
D2
S. Tang, B. Andres, M. Andriluka, and B. Schiele
“Subgraph Decomposition for Multi-target Tracking,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2015), Boston, MA, USA, 2015.
- PDF
- DOI
- PuRe
- BibTeX
757
Conference paper
D2
S. Zhang, R. Benenson, and B. Schiele
“Filtered Channel Features for Pedestrian Detection,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2015), Boston, MA., USA, 2015.
- PDF
- DOI
- PuRe
- BibTeX
758
Conference paper
D2
X. Zhang, Y. Sugano, M. Fritz, and A. Bulling
“Appearance-based Gaze Estimation in the Wild,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2015), Boston, MA, USA, 2015.
- PDF
- DOI
- PuRe
- BibTeX
759
Conference paper
D2D4
B. Pepik, M. Stark, P. Gehler, T. Ritschel, and B. Schiele
“3D Object Class Detection in the Wild,” in IEEE Conference on Computer Vision and Pattern Recognition Workshops (3DSI 2015), Boston, MA, USA, 2015.
- PDF
- DOI
- PuRe
- BibTeX
760
Conference paper
D2
J. Seiter, W.-C. Chiu, M. Fritz, O. Amft, and G. Tröster
“Joint Segmentation and Activity Discovery using Semantic and Temporal Priors,” in IEEE International Conference on Pervasive Computing and Communication (PERCOM 2015), St. Louis, MO, USA, 2015.
- PDF
- DOI
- PuRe
- BibTeX
761
Conference paper
D2
W. Li and M. Fritz
“Teaching Robots the Use of Human Tools from Demonstration with Non-dexterous End-effectors,” in 2015 IEEE-RAS International Conference on Humanoid Robots (HUMANOIDS 2015), Seoul, South Korea, 2015.
- PDF
- DOI
- PuRe
- BibTeX
762
Article
D2
T. Deselaers, D. Keysers, J. Hosang, and H. Rowley
“GyroPen: Gyroscopes for Pen-Input with Mobile Phones,” IEEE Transactions on Human-Machine Systems, vol. 45, no. 2, 2015.
- PDF
- DOI
- PuRe
- BibTeX
763
Article
D2
Y. Sugano, Y. Matsushita, Y. Sato, and H. Koike
“Appearance-based Gaze Estimation with Online Calibration from Mouse Operations,” IEEE Transactions on Human-Machine Systems, vol. 45, no. 6, 2015.
- PDF
- DOI
- PuRe
- BibTeX
764
Article
D2
F. Lu, Y. Sugano, T. Okabe, and Y. Sato
“Gaze Estimation From Eye Appearance: A Head Pose-free Method via Eye Image Synthesis,” IEEE Transactions on Image Processing, vol. 24, no. 11, 2015.
765
Article
D2
D. Bouget, R. Benenson, M. Omran, L. Riffaud, B. Schiele, and P. Jannin
“Detecting Surgical Tools by Modelling Local Appearance and Global Shape,” IEEE Transactions on Medical Imaging, vol. 34, no. 12, 2015.
766
Article
D2
B. Pepik, M. Stark, P. Gehler, and B. Schiele
“Multi-view and 3D Deformable Part Models,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 37, no. 11, 2015.
- PDF
- DOI
- PuRe
- BibTeX
767
Conference paper
D2
P. Müller, S. Amin, P. Verma, M. Andriluka, and A. Bulling
“Emotion Recognition from Embedded Bodily Expressions and Speech During Dyadic Interactions,” in International Conference on Affective Computing and Intelligent Interaction (ACII 2015), Xi’an, China, 2015.
- PDF
- DOI
- PuRe
- BibTeX
768
Article
D2
J. H. Kappes, B. Andres, F. A. Hamprecht, C. Schnörr, S. Nowozin, D. Batra, S. Kim, B. X. Kausler, T. Kröger, J. Lellmann, N. Komodakis, B. Savchynskyy, and C. Rother
“A Comparative Study of Modern Inference Techniques for Structured Discrete Energy Minimization Problems,” International Journal of Computer Vision, vol. 115, no. 2, 2015.
mehr
Abstract
Szeliski et al. published an influential study in 2006 on energy minimization
methods for Markov Random Fields (MRF). This study provided valuable insights
in choosing the best optimization technique for certain classes of problems.
While these insights remain generally useful today, the phenomenal success of
random field models means that the kinds of inference problems that have to be
solved changed significantly. Specifically, the models today often include
higher order interactions, flexible connectivity structures, large
la\-bel-spaces of different cardinalities, or learned energy tables. To reflect
these changes, we provide a modernized and enlarged study. We present an
empirical comparison of 32 state-of-the-art optimization techniques on a corpus
of 2,453 energy minimization instances from diverse applications in computer
vision. To ensure reproducibility, we evaluate all methods in the OpenGM 2
framework and report extensive results regarding runtime and solution quality.
Key insights from our study agree with the results of Szeliski et al. for the
types of models they studied. However, on new and challenging types of models
our findings disagree and suggest that polyhedral methods and integer
programming solvers are competitive in terms of runtime and solution quality
over a large range of model types.
769
Article
D2
Z. Zia, M. Stark, and K. Schindler
“Towards Scene Understanding with Detailed 3D Object Representations,” International Journal of Computer Vision, vol. 112, no. 2, 2015.
- PDF
- DOI
- PuRe
- BibTeX
770
Conference paper
D2
T. Loetscher, C. Chen, S. Hoppe, A. Bulling, S. Wignall, C. Owen, N. Thomas, and A. Lee
“Walking Reduces Spatial Neglect,” in Journal of the International Neuropsychological Society, Sydney, Australia, 2015, vol. 21, no. S2.
- PDF
- DOI
- PuRe
- BibTeX
771
Conference paper
D2
M. Fritz
“Bridging the Gap Between Synthetic and Real Data,” in Machine Learning with Interdependent and Non-identically Distributed Data, Dagstuhl, Germany, 2016, no. 4.
772
Article
D2
M. Rempfler, M. Schneider, G. D. Ielacqua, X. Xiao, S. R. Stock, J. Klohs, G. Székely, B. Andres, and B. H. Menze
“Reconstructing Cerebrovascular Networks under Local Physiological Constraints by Integer Programming,” Medical Image Analysis, vol. 25, no. 1, 2015.
773
Conference paper
D2
F. Alt, S. Schneegass, A. Shirazi, M. Hassib, and A. Bulling
“Graphical Passwords in the Wild: Understanding How Users Choose Pictures and Passwords in Image-based Authentication Schemes,” in MobileHCI’15, 17th International Conference on Human-Computer Interaction with Mobile Devices and Services, Copenhagen, Denmark, 2015.
774
Conference paper
D2D4
B. Pepik, R. Benenson, T. Ritschel, and B. Schiele
“What is Holding Back Convnets for Detection?,” in Pattern Recognition (GCPR 2015), Aachen, Germany, 2015.
- PDF
- DOI
- PuRe
- BibTeX
775
Conference paper
D2
A. Rohrbach, M. Rohrbach, and B. Schiele
“The Long-short Story of Movie Description,” in Pattern Recognition (GCPR 2015), Aachen, Germany, 2015.
- PDF
- DOI
- PuRe
- BibTeX
776
Article
D2
Y. Zhang, M. K. Chong, A. Bulling, and H. Gellersen
“Eye Tracking for Public Displays in the Wild,” Personal and Ubiquitous Computing, vol. 19, no. 5, 2015.
- PDF
- DOI
- PuRe
- BibTeX
777
Conference paper
D2
J. Kulshrestha, M. B. Zafar, L. E. Espin Noboa, K. Gummadi, and S. Gosh
“Characterizing Information Diets of Social Media Users,” in Proceedings of the 9th International AAAI Conference on Web and Social Media (ICWSM 2015), Oxford, UK, 2015.
- PuRe
- BibTeX
778
Conference poster
D2
M. Cordts, M. Omran, S. Ramos, T. Scharwächter, M. Enzweiler, R. Benenson, U. Franke, S. Roth, and B. Schiele
“The Cityscapes Dataset,” The Future of Datasets in Vision 2015 (CVPR 2015 Workshop). 2015.
779
Conference paper
D2
G. Sharma and P. Pérez
“Latent Max-margin Metric Learning for Comparing Video Face Tubes,” in The IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW 2015), Boston, MA, USA, 2015.
- PDF
- DOI
- PuRe
- BibTeX
780
Conference poster
D2
M. Malinowski and M. Fritz
“Hard to Cheat: A Turing Test based on Answering Questions about Images,” Twenty-Ninth AAAI Conference on Artificial Intelligence W6, Beyond the Turing Test (AAAI 2015 W6, Beyond the Turing Test), 2015. [Online]. Available: https://arxiv.org/abs/1501.03302.
mehr
Abstract
Progress in language and image understanding by machines has sparkled the
interest of the research community in more open-ended, holistic tasks, and
refueled an old AI dream of building intelligent machines. We discuss a few
prominent challenges that characterize such holistic tasks and argue for
"question answering about images" as a particular appealing instance of such a
holistic task. In particular, we point out that it is a version of a Turing
Test that is likely to be more robust to over-interpretations and contrast it
with tasks like grounding and generation of descriptions. Finally, we discuss
tools to measure progress in this field.
781
Conference paper
D2
J. Steil and A. Bulling
“Discovery of Everyday Human Activities From Long-Term Visual Behaviour Using Topic Models,” in UbiComp 2015, ACM International Joint Conference on Pervasive and Ubiquitous Computing, Osaka, Japan, 2015.
- PDF
- DOI
- PuRe
- BibTeX
782
Conference paper
D2
R. Walter, A. Bulling, D. Lindbauer, M. Schuessler, and J. Müller
“Analyzing Visual Attention During Whole Body Interaction with Public Displays,” in UbiComp 2015, ACM International Joint Conference on Pervasive and Ubiquitous Computing, Osaka, Japan, 2015.
- PDF
- DOI
- PuRe
- BibTeX
783
Conference paper
D2
A. Bulling
“Human Visual Behaviour for Collaborative Human-Machine Interaction,” in UbiComp & ISWC’15, ACM International Joint Conference on Pervasive and Ubiquitous Computing, Osaka, Japan, 2015.
- PDF
- DOI
- PuRe
- BibTeX
784
Conference paper
D2
A. Esteves, E. Velloso, A. Bulling, and H. Gellersen
“Orbits: Enabling Gaze Interaction in Smart Watches Using Moving Targets,” in UbiComp & ISWC’15, ACM International Joint Conference on Pervasive and Ubiquitous Computing, Osaka, Japan, 2015.
- PDF
- DOI
- PuRe
- BibTeX
785
Conference paper
D2
S. Hoppe, T. Loetscher, S. Morey, and A. Bulling
“Recognition of Curiosity Using Eye Movement Analysis,” in UbiComp & ISWC’15, ACM International Joint Conference on Pervasive and Ubiquitous Computing, Osaka, Japan, 2015.
- PDF
- DOI
- PuRe
- BibTeX
786
Conference paper
D2
M. Khamis, F. Alt, and A. Bulling
“A Field Study on Spontaneous Gaze-based Interaction with a Public Display using Pursuits,” in UbiComp & ISWC’15, ACM International Joint Conference on Pervasive and Ubiquitous Computing, Osaka, Japan, 2015.
- PDF
- DOI
- PuRe
- BibTeX
787
Conference paper
D2
M. Khamis, A. Bulling, and F. Alt
“Tackling Challenges of Interactive Public Displays Using Gaze,” in UbiComp & ISWC’15, ACM International Joint Conference on Pervasive and Ubiquitous Computing, Osaka, Japan, 2015.
- PDF
- DOI
- PuRe
- BibTeX
788
Conference paper
D2
F. Alt, A. Bulling, G. Gravanis, and D. Buschek
“GravitySpot: Guiding Users in Front of Public Displays Using On-Screen Visual Cues,” in UIST’15, 28th Annual ACM Symposium on User Interface Software and Technology, Charlotte, NC, USA, 2015.
- PDF
- DOI
- PuRe
- BibTeX
789
Conference paper
D2
A. Esteves, E. Velloso, A. Bulling, and H. Gellersen
“Orbits: Gaze Interaction for Smart Watches using Smooth Pursuit Eye Movements,” in UIST’15, 28th Annual ACM Symposium on User Interface Software and Technology, Charlotte, NC, USA, 2015.
790
Conference paper
D2
C. Lander, S. Gehring, A. Krüger, S. Boring, and A. Bulling
“GazeProjector: Accurate Gaze Estimation and Seamless Gaze Interaction Across Multiple Displays,” in UIST’15, 28th Annual ACM Symposium on User Interface Software and Technology, Charlotte, NC, USA, 2015.
- PDF
- DOI
- PuRe
- BibTeX
791
Conference paper
D2
Y. Sugano and A. Bulling
“Self-calibrating Head-mounted Eye Trackers Using Egocentric Visual Saliency,” in UIST’15, 28th Annual ACM Symposium on User Interface Software and Technology, Charlotte, NC, USA, 2015.
- PDF
- DOI
- PuRe
- BibTeX
792
Paper
D2
J. Hosang, R. Benenson, P. Dollár, and B. Schiele
“What Makes for Effective Detection Proposals?,” 2015. [Online]. Available: http://arxiv.org/abs/1502.05082.
mehr
Abstract
Current top performing object detectors employ detection proposals to guide
the search for objects, thereby avoiding exhaustive sliding window search
across images. Despite the popularity and widespread use of detection
proposals, it is unclear which trade-offs are made when using them during
object detection. We provide an in-depth analysis of twelve proposal methods
along with four baselines regarding proposal repeatability, ground truth
annotation recall on PASCAL and ImageNet, and impact on DPM and R-CNN detection
performance. Our analysis shows that for object detection improving proposal
localisation accuracy is as important as improving recall. We introduce a novel
metric, the average recall (AR), which rewards both high recall and good
localisation and correlates surprisingly well with detector performance. Our
findings show common strengths and weaknesses of existing methods, and provide
insights and metrics for selecting and tuning proposal methods.
- PuRe
- BibTeX
793
Thesis
D2IMPR-CSD4
B. Pepik
“Richer Object Representations for Object Class Detection in Challenging Real World Image,” Universität des Saarlandes, Saarbrücken, 2015.
794
Paper
D2
I. Shcherbatyi, A. Bulling, and M. Fritz
“GazeDPM: Early Integration of Gaze Information in Deformable Part Models,” 2015. [Online]. Available: http://arxiv.org/abs/1505.05753.
mehr
Abstract
An increasing number of works explore collaborative human-computer systems in
which human gaze is used to enhance computer vision systems. For object
detection these efforts were so far restricted to late integration approaches
that have inherent limitations, such as increased precision without increase in
recall. We propose an early integration approach in a deformable part model,
which constitutes a joint formulation over gaze and visual data. We show that
our GazeDPM method improves over the state-of-the-art DPM baseline by 4% and a
recent method for gaze-supported object detection by 3% on the public POET
dataset. Our approach additionally provides introspection of the learnt models,
can reveal salient image structures, and allows us to investigate the interplay
between gaze attracting and repelling areas, the importance of view-specific
models, as well as viewers' personal biases in gaze patterns. We finally study
important practical aspects of our approach, such as the impact of using
saliency maps instead of real fixations, the impact of the number of fixations,
as well as robustness to gaze estimation error.
795
Paper
D2
M. Tonsen, X. Zhang, Y. Sugano, and A. Bulling
“Labeled Pupils in the Wild: A Dataset for Studying Pupil Detection in Unconstrained Environments,” 2015. [Online]. Available: http://arxiv.org/abs/1511.05768.
mehr
Abstract
We present labelled pupils in the wild (LPW), a novel dataset of 66
high-quality, high-speed eye region videos for the development and evaluation
of pupil detection algorithms. The videos in our dataset were recorded from 22
participants in everyday locations at about 95 FPS using a state-of-the-art
dark-pupil head-mounted eye tracker. They cover people with different
ethnicities, a diverse set of everyday indoor and outdoor illumination
environments, as well as natural gaze direction distributions. The dataset also
includes participants wearing glasses, contact lenses, as well as make-up. We
benchmark five state-of-the-art pupil detection algorithms on our dataset with
respect to robustness and accuracy. We further study the influence of image
resolution, vision aids, as well as recording location (indoor, outdoor) on
pupil detection performance. Our evaluations provide valuable insights into the
general pupil detection problem and allow us to identify key challenges for
robust pupil detection on head-mounted eye trackers.
- PuRe
- BibTeX

2014

796
Article
D2
A. Bulling, U. Blanke, and B. Schiele
“A Tutorial on Human Activity Recognition Using Body-worn Inertial Sensors,” ACM Computing Surveys, vol. 46, no. 3, 2014.
- PDF
- DOI
- PuRe
- BibTeX
797
Article
D2
M. Vidal, A. Bulling, and H. Gellersen
“Pursuits: Spontaneous Eye-based Interaction for Dynamic Interfaces,” ACM SIGMOBILE Mobile Computing and Communications Review, vol. 18, no. 4, 2014.
mehr
Abstract
Although gaze is an attractive modality for pervasive
interaction, real-world implementation of eye-based interfaces poses
significant challenges. In particular, user calibration is tedious and
time consuming. Pursuits is an innovative interaction technique that
enables truly spontaneous interaction with eye-based interfaces. A user
can simply walk up to the screen and readily interact with moving
targets. Instead of being based on gaze location, Pursuits correlates
eye pursuit movements with objects dynamically moving on the interface.
- PDF
- DOI
- PuRe
- BibTeX
798
Conference paper
D2
M. Malinowski and M. Fritz
“A Multi-world Approach to Question Answering about Real-world Scenes based on Uncertain Input,” in Advances in Neural Information Processing Systems 27 (NIPS 2014), Montreal, Canada, 2014.
799
Book chapter / section
D2
P. Majaranta and A. Bulling
“Eye Tracking and Eye-based Human–computer Interaction,” in Advances in Physiological Computing, London: Springer, 2014.
- PDF
- DOI
- PuRe
- BibTeX
800
Conference paper
D2
M. Simkin, A. Bulling, M. Fritz, and D. Schröder
“Ubic: Bridging the Gap Between Digital Cryptography and the Physical World,” in Computer Security - ESORICS 2014, Wrocław, Poland, 2014.
- PDF
- DOI
- PuRe
- BibTeX
801
Article
D2
S. Wuhrer, L. Pishchulin, A. Brunton, C. Shu, and J. Lang
“Estimation of Human Body Shape and Posture under Clothing,” Computer Vision and Image Understanding, vol. 127, 2014.
- PDF
- DOI
- PuRe
- BibTeX
802
Conference paper
D2
M. Mathias, R. Benenson, M. Pedersoli, and L. Van Gool
“Face Detection Without Bells and Whistles,” in Computer Vision - ECCV 2014, Zurich, Switzerland, 2014.
- PDF
- DOI
- PuRe
- BibTeX
803
Conference paper
D2
X. Wang, B. Schiele, P. Fua, V. Belagiannis, S. Ilic, and N. Navab
“Multiple Human Pose Estimation with Temporally Consistent 3D Pictorial Structures,” in Computer Vision - ECCV 2014 Workshops, Zürich, Switzerland, 2015.
804
Conference paper
D2
T. Brox, F. Galasso, F. Li, J. M. Rehg, and B. Schiele
“First International Workshop on Video Segmentation -- Panel Discussion,” in Computer Vision -- ECCV 2014 Workshops, Zurich, Switzerland, 2015.
805
Conference paper
D2
R. Benenson, M. Omran, J. Hosang, and B. Schiele
“Ten Years of Pedestrian Detection, What Have We Learned?,” in Computer Vision - ECCV 2014 Workshops (ECCV 2014 Workshop CVRSUAD), Zürich, Switzerland, 2015.
- PDF
- DOI
- PuRe
- BibTeX
806
Conference paper
D2
M. Andriluka, L. Pishchulin, P. Gehler, and B. Schiele
“2D Human Pose Estimation: New Benchmark and State of the Art Analysis,” in 2014 IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2014), Columbus, OH, USA, 2014.
- PDF
- DOI
- PuRe
- BibTeX
807
Conference paper
D2
V. Belagiannis, S. Amin, M. Andriluka, B. Schiele, N. Navab, and S. Ilic
“3D Pictorial Structures for Multiple Human Pose Estimation,” in 2014 IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2014), Columbus, OH, USA, 2014.
- PDF
- DOI
- PuRe
- BibTeX
808
Conference paper
D2
F. Galasso, M. Keuper, T. Brox, and B. Schiele
“Spectral Graph Reduction for Efficient Image and Streaming Video Segmentation,” in 2014 IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2014), Columbus, OH, USA, 2014.
- PDF
- DOI
- PuRe
- BibTeX
809
Conference paper
D2
S. Karayev, M. Fritz, and T. Darrell
“Anytime Recognition of Objects and Scenes,” in 2014 IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2014), Columbus, OH, USA, 2014.
- PDF
- DOI
- PuRe
- BibTeX
810
Conference paper
D2
M. Lapin, B. Schiele, and M. Hein
“Scalable Multitask Representation Learning for Scene Classification,” in 2014 IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2014), Columbus, OH, USA, 2014.
- PDF
- DOI
- PuRe
- BibTeX
811
Conference paper
D4D2
K. Rematas, T. Ritschel, M. Fritz, and T. Tuytelaars
“Image-based Synthesis and Re-Synthesis of Viewpoints Guided by 3D Models,” in 2014 IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2014), Columbus, OH, USA, 2014.
- PDF
- DOI
- PuRe
- BibTeX
812
Conference paper
D2
M. Z. Zia, M. Stark, and K. Schindler
“Are Cars Just 3D Boxes? - Jointly Estimating the 3D Shape of Multiple Objects,” in 2014 IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2014), Columbus, OH, USA, 2014.
- PDF
- DOI
- PuRe
- BibTeX
813
Article
D2
A. Bulling and T. O. Zander
“Cognition-aware Computing,” IEEE Pervasive Computing, vol. 13, no. 3, 2014.
- PDF
- DOI
- PuRe
- BibTeX
814
Article
D2
A. Geiger, M. Lauer, C. Wojek, C. Stiller, and R. Urtasun
“3D Traffic Scene Understanding from Movable Platforms,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 36, no. 5, 2014.
- PDF
- DOI
- PuRe
- BibTeX
815
Conference paper
D2
A. Jain, J. Tompson, M. Andriluka, G. W. Taylor, and C. Bregler
“Learning Human Pose Estimation Features with Convolutional Networks,” in International Conference on Learning Representations 2014 (ICLR 2014), Banff, Canada, 2014.
mehr
Abstract
This paper introduces a new architecture for human pose estimation using a
multi- layer convolutional network architecture and a modified learning
technique that learns low-level features and higher-level weak spatial models.
Unconstrained human pose estimation is one of the hardest problems in computer
vision, and our new architecture and learning schema shows significant
improvement over the current state-of-the-art results. The main contribution of
this paper is showing, for the first time, that a specific variation of deep
learning is able to outperform all existing traditional architectures on this
task. The paper also discusses several lessons learned while researching
alternatives, most notably, that it is possible to learn strong low-level
feature detectors on features that might even just cover a few pixels in the
image. Higher-level spatial models improve somewhat the overall result, but to
a much lesser extent then expected. Many researchers previously argued that the
kinematic structure and top-down information is crucial for this domain, but
with our purely bottom up, and weak spatial model, we could improve other more
complicated architectures that currently produce the best results. This mirrors
what many other researchers, like those in the speech recognition, object
recognition, and other domains have experienced.
816
Conference paper
D2
B. Pepik, M. Stark, P. Gehler, and B. Schiele
“Multi-view Priors for Learning Detectors from Sparse Viewpoint Data,” in International Conference on Learning Representations 2014 (ICLR 2014), Banff, Canada, 2014.
mehr
Abstract
While the majority of today's object class models provide only 2D bounding
boxes, far richer output hypotheses are desirable including viewpoint,
fine-grained category, and 3D geometry estimate. However, models trained to
provide richer output require larger amounts of training data, preferably well
covering the relevant aspects such as viewpoint and fine-grained categories. In
this paper, we address this issue from the perspective of transfer learning,
and design an object class model that explicitly leverages correlations between
visual features. Specifically, our model represents prior distributions over
permissible multi-view detectors in a parametric way -- the priors are learned
once from training data of a source object class, and can later be used to
facilitate the learning of a detector for a target class. As we show in our
experiments, this transfer is not only beneficial for detectors based on
basic-level category representations, but also enables the robust learning of
detectors that represent classes at finer levels of granularity, where training
data is typically even scarcer and more unbalanced. As a result, we report
largely improved performance in simultaneous 2D object localization and
viewpoint estimation on a recent dataset of challenging street scenes.
817
Conference paper
D2
B. Pepik, M. Stark, P. Gehler, and B. Schiele
“Multi-View Priors for Learning Detectors from Sparse Viewpoint Data,” in International Conference on Learning Representations 2014 (ICLR 2014), Banff, Canada, 2014.
mehr
Abstract
While the majority of today's object class models provide only 2D bounding boxes, far richer output hypotheses are desirable including viewpoint, fine-grained category, and 3D geometry estimate. However, models trained to provide richer output require larger amounts of training data, preferably well covering the relevant aspects such as viewpoint and fine-grained categories. In this paper, we address this issue from the perspective of transfer learning, and design an object class model that explicitly leverages correlations between visual features. Specifically, our model represents prior distributions over permissible multi-view detectors in a parametric way -- the priors are learned once from training data of a source object class, and can later be used to facilitate the learning of a detector for a target class. As we show in our experiments, this transfer is not only beneficial for detectors based on basic-level category representations, but also enables the robust learning of detectors that represent classes at finer levels of granularity, where training data is typically even scarcer and more unbalanced. As a result, we report largely improved performance in simultaneous 2D object localization and viewpoint estimation on a recent dataset of challenging street scenes.
- PuRe
- BibTeX
818
Article
D2
S. Tang, M. Andriluka, and B. Schiele
“Detection and Tracking of Occluded People,” International Journal of Computer Vision, vol. 110, no. 1, 2014.
- PDF
- DOI
- PuRe
- BibTeX
819
Article
D2
A. Bulling and R. Bednarik
“Introduction to the PETMEI Special Issue,” Journal of Eye Movement Research, vol. 7, no. 3, 2014.
- PuRe
- BibTeX
820
Proceedings
D2
D. Fleet, T. Pajdla, B. Schiele, and T. Tuytelaars
Eds., Computer Vision - ECCV 2014. Springer, 2014.
- PuRe
- BibTeX
821
Conference paper
D2
J. Funke, J. N. P. Martel, S. Gerhard, B. Andres, D. C. Ciresan, A. Giusti, L. M. Gambardella, J. Schmidhuber, H. Pfister, A. Cardona, and M. Cook
“Candidate Sampling for Neuron Reconstruction from Anisotropic Electron Microscopy Volumes,” in Medical Image Computing and Computer-assisted Intervention -- MICCAI 2014, Boston, MA, USA, 2014.
822
Conference paper
D2
M. Rempfler, M. Schneider, G. D. Ielacqua, X. Xiao, S. R. Stock, J. Klohs, G. Székely, B. Andres, and B. H. Menze
“Extracting Vascular Networks under Physiological Constraints via Integer Programming,” in Medical Image Computing and Computer-assisted Intervention -- MICCAI 2014, Boston, MA, USA, 2014.
823
Article
D2
M. Lapin, M. Hein, and B. Schiele
“Learning Using Privileged Information: SVM+ and Weighted SVM,” Neural Networks, vol. 53, 2014.
- PDF
- DOI
- PuRe
- BibTeX
824
Conference paper
D2
M. Malinowski and M. Fritz
“Towards a Visual Turing Challenge,” in NIPS 2014 Workshop on Learning Semantics, Montréal, Canada, 2014.
mehr
Abstract
As language and visual understanding by machines progresses rapidly, we are observing an increasing interest in holistic architectures that tightly interlink both modalities in a joint learning and inference process. This trend has allowed the community to progress towards more challenging and open tasks and refueled the hope at achieving the old AI dream of building machines that could pass a turing test in open domains. In order to steadily make progress towards this goal, we realize that quantifying performance becomes increasingly difficult. Therefore we ask how we can precisely define such challenges and how we can evaluate different algorithms on this open tasks? In this paper, we summarize and discuss such challenges as well as try to give answers where appropriate options are available in the literature. We exemplify some of the solutions on a recently presented dataset of question-answering task based on real-world indoor images that establishes a visual turing challenge. Finally, we argue despite the success of unique ground-truth annotation, we likely have to step away from carefully curated dataset and rather rely on ’}social consensus{’ as the main driving force to create suitable benchmarks. Providing coverage in this inherently ambiguous output space is an emerging challenge that we face in order to make quantifiable progress in this area.
825
Conference paper
D2
L. Pishchulin, M. Andriluka, P. Gehler, and B. Schiele
“Expressive Models and Comprehensive Benchmark for 2D Human Pose Estimation,” in Parts and Attributes (ECCV 2014 Workshop PA), Zurich, Switzerland, 2014.
826
Conference paper
D2
S. Amin, P. Müller, A. Bulling, and M. Andriluka
“Test-time Adaptation for 3D Human Pose Estimation,” in Pattern Recognition (GCPR 2014), Münster, Germany, 2014.
- PDF
- DOI
- PuRe
- BibTeX
827
Conference paper
D2
A. Khoreva, F. Galasso, M. Hein, and B. Schiele
“Learning Must-Link Constraints for Video Segmentation Based on Spectral Clustering,” in Pattern Recognition (GCPR 2014), Münster, Germany, 2014.
- PDF
- DOI
- PuRe
- BibTeX
828
Conference paper
D2
W. Li
“Learning Multi-scale Representations for Material Classification,” in Pattern Recognition (GCPR 2014), Münster, Germany, 2014.
- PDF
- DOI
- PuRe
- BibTeX
829
Conference paper
D2
L. Pishchulin, M. Andriluka, and B. Schiele
“Fine-grained Activity Recognition with Holistic and Pose Based Features,” in Pattern Recognition (GCPR 2014), Münster, Germany, 2014.
- PDF
- DOI
- PuRe
- BibTeX
830
Conference paper
D2
A. Rohrbach, M. Rohrbach, W. Qiu, A. Friedrich, M. Pinkal, and B. Schiele
“Coherent Multi-sentence Video Description with Variable Level of Detail,” in Pattern Recognition (GCPR 2014), Münster, Germany, 2014.
- PDF
- DOI
- PuRe
- BibTeX
831
Conference paper
D2
J. Turner, A. Bulling, J. Alexander, and H. Gellersen
“Cross-device Gaze-supported Point-to-point Content Transfer,” in Proceedings ETRA 2014, Safety Harbor, FL, USA, 2014.
- PDF
- DOI
- PuRe
- BibTeX
832
Conference paper
D2
E. Wood and A. Bulling
“EyeTab: Model-based Gaze Estimation on Unmodified Tablet Computers,” in Proceedings ETRA 2014, Safety Harbor, FL, USA, 2014.
- PDF
- DOI
- PuRe
- BibTeX
833
Conference paper
D2
S. Ishimaru, K. Kunze, K. Kise, J. Weppner, A. Dengel, P. Lukowicz, and A. Bulling
“In the Blink of an Eye - Combining Head Motion and Eye Blink Frequency for Activity Recognition with Google Glass,” in Proceedings of the 5th Augmented Human International Conference (AH 2014), Kobe, Japan, 2014.
- PDF
- DOI
- PuRe
- BibTeX
834
Conference paper
D2
W.-C. Chiu, G. Johnson, D. McCulley, O. Grau, and M. Fritz
“Object Disambiguation for Augmented Reality Applications,” in Proceedings of the British Machine Vision Conference (BMVC 2014), Nottingham, UK, 2014.
835
Conference paper
D2
J. Hosang, R. Benenson, and B. Schiele
“How Good are Detection Proposals, really?,” in Proceedings of the British Machine Vision Conference (BMVC 2014), Nottingham, UK, 2014.
mehr
Abstract
Current top performing Pascal VOC object detectors employ detection proposals to guide the search for objects thereby avoiding exhaustive sliding window search across images. Despite the popularity of detection proposals, it is unclear which trade‐offs are made when using them during object detection. We provide an in depth analysis of ten object proposal methods along with four baselines regarding ground truth annotation recall (on Pascal VOC 2007 and ImageNet 2013), repeatability, and impact on DPM detector performance. Our findings show common weaknesses of existing methods, and provide insights to choose the most adequate method for different settings.
836
Conference paper
D2
Y. Zhang, A. Bulling, and H. Gellersen
“Pupil-Canthi-Ratio: A Calibration-free Method for Tracking Horizontal Gaze Direction,” in Proceedings of the 2014 International Working Conference on Advanced Visual Interfaces (AVI 2014), Como, Italy, 2014.
- PDF
- DOI
- PuRe
- BibTeX
837
Conference poster
D2
M. Lapin, B. Schiele, and M. Hein
“Scalable Multitask Representation Learning for Scene Classification,” Scene Understanding Workshop (SUNw 2014). 2014.
- PuRe
- BibTeX
838
Conference poster
D2
S. Tang, M. Andriluka, A. Milan, K. Schindler, S. Roth, and B. Schiele
“Learning People Detectors for Tracking in Crowded Scenes,” Scene Understanding Workshop (SUNw 2014). 2014.
- PuRe
- BibTeX
839
Conference poster
D2
M. Z. Zia, M. Stark, and K. Schindler
“High-Resolution 3D Layout from a Single View,” Scene Understanding Workshop (SUNw 2014). 2014.
840
Conference paper
D2
S. Schneegass, F. Steimle, A. Bulling, F. Alt, and A. Schmidt
“SmudgeSafe: Geometric Image Transformations for Smudge-resistant User Authentication,” in UbiComp’14, ACM International Joint Conference on Pervasive and Ubiquitous Computing, Seattle, WA, USA, 2014.
- PDF
- DOI
- PuRe
- BibTeX
841
Conference paper
D2
Y. Zhang, J. Müller, M. K. Chong, A. Bulling, and H. Gellersen
“GazeHorizon: Enabling Passers-by to Interact with Public Displays by Gaze,” in UbiComp’14, ACM International Joint Conference on Pervasive and Ubiquitous Computing, Seattle, WA, USA, 2014.
- PDF
- DOI
- PuRe
- BibTeX
842
Conference paper
D2
M. Kassner, W. Patera, and A. Bulling
“Pupil: An Open Source Platform for Pervasive Eye Tracking and Mobile Gaze-based Interaction,” in UbiComp’14 Adjunct, Seattle, WA, USA, 2014.
- PDF
- DOI
- PuRe
- BibTeX
843
Conference paper
D2
M. Z. Zia, M. Stark, and K. Schindler
“Physically Grounded 3D Scene Interpretation with Detailed Object Models,” in Vision Meets Cognition Workshop: Functionality, Physics, Intentionality, and Causality (CVPR 2014 Workshop FPIC), Columbus, OH, USA, 2014.
844
Paper
D2
Z. Akata, H. Lee, and B. Schiele
“Zero-Shot Learning with Structured Embeddings,” 2014. [Online]. Available: http://arxiv.org/abs/1409.8403.
mehr
Abstract
Despite significant recent advances in image classification, fine-grained
classification remains a challenge. In the present paper, we address the
zero-shot and few-shot learning scenarios as obtaining labeled data is
especially difficult for fine-grained classification tasks. First, we embed
state-of-the-art image descriptors in a label embedding space using side
information such as attributes. We argue that learning a joint embedding space,
that maximizes the compatibility between the input and output embeddings, is
highly effective for zero/few-shot learning. We show empirically that such
embeddings significantly outperforms the current state-of-the-art methods on
two challenging datasets (Caltech-UCSD Birds and Animals with Attributes).
Second, to reduce the amount of costly manual attribute annotations, we use
alternate output embeddings based on the word-vector representations, obtained
from large text-corpora without any supervision. We report that such
unsupervised embeddings achieve encouraging results, and lead to further
improvements when combined with the supervised ones.
- PuRe
- BibTeX
845
Paper
D2
W. Li and M. Fritz
“Learning Multi-scale Representations for Material Classification,” 2014. [Online]. Available: http://arxiv.org/abs/1408.2938.
mehr
Abstract
The recent progress in sparse coding and deep learning has made unsupervised
feature learning methods a strong competitor to hand-crafted descriptors. In
computer vision, success stories of learned features have been predominantly
reported for object recognition tasks. In this paper, we investigate if and how
feature learning can be used for material recognition. We propose two
strategies to incorporate scale information into the learning procedure
resulting in a novel multi-scale coding procedure. Our results show that our
learned features for material recognition outperform hand-crafted descriptors
on the FMD and the KTH-TIPS2 material classification benchmarks.
846
Paper
D2
M. Malinowski and M. Fritz
“A Pooling Approach to Modelling Spatial Relations for Image Retrieval and Annotation,” 2014. [Online]. Available: http://arxiv.org/abs/1411.5190.
mehr
Abstract
Over the last two decades we have witnessed strong progress on modeling
visual object classes, scenes and attributes that have significantly
contributed to automated image understanding. On the other hand, surprisingly
little progress has been made on incorporating a spatial representation and
reasoning in the inference process. In this work, we propose a pooling
interpretation of spatial relations and show how it improves image retrieval
and annotations tasks involving spatial language. Due to the complexity of the
spatial language, we argue for a learning-based approach that acquires a
representation of spatial relations by learning parameters of the pooling
operator. We show improvements on previous work on two datasets and two
different tasks as well as provide additional insights on a new dataset with an
explicit focus on spatial relations.
847
Paper
D5D2
L. Qu and B. Andres
“Estimating Maximally Probable Constrained Relations by Mathematical Programming,” 2014. [Online]. Available: http://arxiv.org/abs/1408.0838.
mehr
Abstract
Estimating a constrained relation is a fundamental problem in machine
learning. Special cases are classification (the problem of estimating a map
from a set of to-be-classified elements to a set of labels), clustering (the
problem of estimating an equivalence relation on a set) and ranking (the
problem of estimating a linear order on a set). We contribute a family of
probability measures on the set of all relations between two finite, non-empty
sets, which offers a joint abstraction of multi-label classification,
correlation clustering and ranking by linear ordering. Estimating (learning) a
maximally probable measure, given (a training set of) related and unrelated
pairs, is a convex optimization problem. Estimating (inferring) a maximally
probable relation, given a measure, is a 01-linear program. It is solved in
linear time for maps. It is NP-hard for equivalence relations and linear
orders. Practical solutions for all three cases are shown in experiments with
real data. Finally, estimating a maximally probable measure and relation
jointly is posed as a mixed-integer nonlinear program. This formulation
suggests a mathematical programming approach to semi-supervised learning.
- PuRe
- BibTeX
848
Thesis
D2IMPR-CS
M. Rohrbach
“Combining Visual Recognition and Computational Linguistics : Linguistic Knowledge for Visual Recognition and Natural Language Descriptions of Visual Content,” Universität des Saarlandes, Saarbrücken, 2014.
- PDF
- DOI
- PuRe
- BibTeX
849
Paper
D2
A. Senina, M. Rohrbach, W. Qiu, A. Friedrich, S. Amin, M. Andriluka, M. Pinkal, and B. Schiele
“Coherent Multi-sentence Video Description with Variable Level of Detail,” 2014. [Online]. Available: http://arxiv.org/abs/1403.6173.
mehr
Abstract
Humans can easily describe what they see in a coherent way and at varying
level of detail. However, existing approaches for automatic video description
are mainly focused on single sentence generation and produce descriptions at a
fixed level of detail. In this paper, we address both of these limitations: for
a variable level of detail we produce coherent multi-sentence descriptions of
complex videos. We follow a two-step approach where we first learn to predict a
semantic representation (SR) from video and then generate natural language
descriptions from the SR. To produce consistent multi-sentence descriptions, we
model across-sentence consistency at the level of the SR by enforcing a
consistent topic. We also contribute both to the visual recognition of objects
proposing a hand-centric approach as well as to the robust generation of
sentences using a word lattice. Human judges rate our multi-sentence
descriptions as more readable, correct, and relevant than related work. To
understand the difference between more detailed and shorter descriptions, we
collect and analyze a video description corpus of three levels of detail.

2013

850
Book chapter / section
D2
S. Ebert and B. Schiele
“Where Next in Object Recognition and how much Supervision Do We Need?,” in Advanced Topics in Computer Vision, London: Springer, 2013.
- PDF
- DOI
- PuRe
- BibTeX
851
Conference paper
D2
M. Rohrbach, S. Ebert, and B. Schiele
“Transfer Learning in a Transductive Setting,” in Advances in Neural Information Processing Systems 26 (NIPS 2013), Lake Tahoe, NV, USA, 2013.
mehr
Abstract
Category models for objects or activities typically rely on supervised
learning requiring sufficiently large training sets. Transferring
knowledge from known categories to novel classes with no or only
a few labels however is far less researched even though it is a common
scenario. In this work, we extend transfer learning with semi-supervised
learning to exploit unlabeled instances of (novel) categories with
no or only a few labeled instances. Our proposed approach Propagated
Semantic Transfer combines three main ingredients. First, we transfer
information from known to novel categories by incorporating external
knowledge, such as linguistic or expert-specified information, e.g.,
by a mid-level layer of semantic attributes. Second, we exploit the
manifold structure of novel classes. More specifically we adapt a
graph-based learning algorithm - so far only used for semi-supervised
learning - to zero-shot and few-shot learning. Third, we improve
the local neighborhood in such graph structures by replacing the
raw feature-based representation with a mid-level object- or attribute-based
representation. We evaluate our approach on three challenging datasets
in two different applications, namely on Animals with Attributes
and ImageNet for image classification and on MPII Composites for
activity recognition. Our approach consistently outperforms state-of-the-art
transfer and semi-supervised approaches on all datasets.
852
Conference paper
D2
A. Bulling, C. Weichel, and H. Gellersen
“EyeContext: Recognition of High-level Contextual Cues from Human Visual Behaviour,” in CHI 2013, The 31st Annual CHI Conference on Human Factors in Computing Systems, Paris, France, 2013.
mehr
Abstract
Automatic annotation of life logging data is challenging. In this
work we present EyeContext, a system to infer high-level contextual
cues from human visual behaviour. We conduct a user study to record
eye movements of four participants over a full day of their daily
life, totalling 42.5 hours of eye movement data. Participants were
asked to self-annotate four non-mutually exclusive cues: social (interacting
with somebody vs. no interaction), cognitive (concentrated work vs.
leisure), physical (physically active vs. not active), and spatial
(inside vs. outside a building). We evaluate a proof-of-concept EyeContext
system that combines encoding of eye movements into strings and a
spectrum string kernel support vector machine (SVM) classifier. Using
person-dependent training, we obtain a top performance of 85.3%
precision (98.0% recall) for recognising social interactions. Our
results demonstrate the large information content available in long-term
human visual behaviour and opens up new venues for research on eye-based
behavioural monitoring and life logging.
- PDF
- DOI
- PuRe
- BibTeX
853
Conference paper
D2
E. Velloso, A. Bulling, and H. Gellersen
“MotionMA: Motion Modelling and Analysis by Demonstration,” in CHI 2013, The 31st Annual CHI Conference on Human Factors in Computing Systems, Paris, France, 2013.
- PDF
- DOI
- PuRe
- BibTeX
854
Conference paper
D2
Y. Zhang, A. Bulling, and H. Gellersen
“SideWays: A Gaze Interface for Spontaneous Interaction with Situated Displays,” in CHI 2013, The 31st Annual CHI Conference on Human Factors in Computing Systems, Paris, France, 2013.
- PDF
- DOI
- PuRe
- BibTeX
855
Conference paper
D2
M. Vidal, K. Pfeuffer, A. Bulling, and H. W. Gellersen
“Pursuits: Eye-based Interaction with Moving Targets,” in CHI 2013 Extended Abstracts, Paris, France, 2013.
mehr
Abstract
Eye-based interaction has commonly been based on estimation of eye
gaze direction, to locate objects for interaction. We introduce Pursuits,
a novel and very different eye tracking method that instead is based
on following the trajectory of eye movement and comparing this with
trajectories of objects in the field of view. Because the eyes naturally
follow the trajectory of moving objects of interest, our method is
able to detect what the user is looking at, by matching eye movement
and object movement. We illustrate Pursuits with three applications
that demonstrate how the method facilitates natural interaction with
moving targets.
- PDF
- DOI
- PuRe
- BibTeX
856
Book chapter / section
D2
A. Janoch, S. Karayev, Y. Jia, J. T. Barron, M. Fritz, K. Saenko, and T. Darrell
“A Category-level 3D Object Dataset: Putting the Kinect to Work,” in Consumer Depth Cameras for Computer Vision, London: Springer, 2013.
857
Conference paper
D2
S. Amin, M. Andriluka, M. Rohrbach, and B. Schiele
“Multi-view Pictorial Structures for 3D Human Pose Estimation,” in Electronic Proceedings of the British Machine Vision Conference 2013 (BMVC 2013), Bristol, UK, 2013.
- PDF
- DOI
- PuRe
- BibTeX
858
Conference paper
D2
M. Malinowski and M. Fritz
“Learning Smooth Pooling Regions for Visual Recognition,” in Electronic Proceedings of the British Machine Vision Conference 2013 (BMVC 2013), Bristol, UK, 2013.
mehr
Abstract
From the early HMAX model to Spatial Pyramid Matching, spatial pooling
has played an important role in visual recognition pipelines. By
aggregating local statistics, it equips the recognition pipelines
with a certain degree of robustness to translation and deformation
yet preserving spatial information. Despite of its predominance in
current recognition systems, we have seen little progress to fully
adapt the pooling strategy to the task at hand. In this paper, we
propose a flexible parameterization of the spatial pooling step and
learn the pooling regions together with the classifier. We investigate
a smoothness regularization term that in conjuncture with an efficient
learning scheme makes learning scalable. Our framework can work with
both popular pooling operators: sum-pooling and max-pooling. Finally,
we show benefits of our approach for object recognition tasks based
on visual words and higher level event recognition tasks based on
object-bank features. In both cases, we improve over the hand-crafted
spatial pooling step showing the importance of its adaptation to
the task.
- PDF
- DOI
- PuRe
- BibTeX
859
Conference paper
D2
B. Andres, J. Yarkony, B. S. Manjunath, S. Kirchhoff, E. Turetken, C. C. Fowlkes, and H. Pfister
“Segmenting Planar Superpixel Adjacency Graphs w.r.t. Non-planar Superpixel Affinity Graphs,” in Energy Minimization Methods in Computer Vision and Pattern Recognition (EMMCVPR 2013), Lund, Sweden, 2013.
- PDF
- DOI
- PuRe
- BibTeX
860
Conference paper
D2
E. Velloso, A. Bulling, and H. Gellersen
“AutoBAP: Automatic Coding of Body Action and Posture Units from Wearable Sensors,” in 2013 Humaine Association Conference on Affective Computing and Intelligent Interaction (ACII 2013), Geneva, Switzerland, 2013.
- PDF
- DOI
- PuRe
- BibTeX
861
Conference paper
D2
J. Turner, J. Alexander, A. Bulling, S. Dominik, and H. Gellersen
“Eye Pull, Eye Push: Moving Objects between Large Screens and Personal Devices with Gaze & Touch,” in Human-Computer Interaction – INTERACT 2013, Cape Town, South Africa, 2013.
mehr
Abstract
Previous work has validated the eyes and mobile input as a viable
approach for pointing at, and selecting out of reach objects. This
work presents Eye Pull, Eye Push, a novel interaction concept for
content transfer between public and personal devices using gaze and
touch. We present three techniques that enable this interaction:
Eye Cut & Paste, Eye Drag & Drop, and Eye Summon & Cast. We outline
and discuss several scenarios in which these techniques can be used.
In a user study we found that participants responded well to the
visual feedback provided by Eye Drag & Drop during object movement.
In contrast, we found that although Eye Summon & Cast significantly
improved performance, participants had difficulty coordinating their
hands and eyes during interaction.
- PDF
- DOI
- PuRe
- BibTeX
862
Conference paper
D2
F. Galasso, N. S. Nagaraja, T. Jiménez Cárdenas, T. Brox, and B. Schiele
“A Unified Video Segmentation Benchmark: Annotation, Metrics and Analysis,” in ICCV 2013, IEEE International Conference on Computer Vision, Sydney, Australia, 2013.
- PDF
- DOI
- PuRe
- BibTeX
863
Conference paper
D4D2
E. Levinkov and M. Fritz
“Sequential Bayesian Model Update under Structured Scene Prior for Semantic Road Scenes Labeling,” in ICCV 2013, IEEE International Conference on Computer Vision, Sydney, Australia, 2013.
- PDF
- DOI
- PuRe
- BibTeX
864
Conference paper
D2
M. Mathias, R. Benenson, R. Timofte, and L. van Gool
“Handling Occlusions with Franken-classifiers,” in ICCV 2013, IEEE International Conference on Computer Vision, Sydney, Australia, 2013.
- PDF
- DOI
- PuRe
- BibTeX
865
Conference paper
D2
M. Rohrbach, W. Qiu, I. Titov, S. Thater, M. Pinkal, and B. Schiele
“Translating Video Content to Natural Language Descriptions,” in ICCV 2013, IEEE International Conference on Computer Vision, Sydney, Australia, 2013.
mehr
Abstract
Humans use rich natural language to describe and communicate visual
perceptions. In order to provide natural language descriptions for
visual content, this paper combines two important ingredients. First,
we generate a rich semantic representation of the visual content
including e.g. object and activity labels. To predict the semantic
representation we learn a CRF to model the relationships between
different components of the visual input. And second, we propose
to formulate the generation of natural language as a machine translation
problem using the semantic representation as source language and
the generated sentences as target language. For this we exploit the
power of a parallel corpus of videos and textual descriptions and
adapt statistical machine translation to translate between our two
languages. We evaluate our video descriptions on the TACoS dataset,
which contains video snippets aligned with sentence descriptions.
Using automatic evaluation and human judgments we show significant
improvements over several base line approaches, motivated by prior
work. Our translation approach also shows improvements over related
work on an image description task.
- PDF
- DOI
- PuRe
- BibTeX
866
Conference paper
D2
S. Tang, M. Andriluka, A. Milan, K. Schindler, S. Roth, and B. Schiele
“Learning People Detectors for Tracking in Crowded Scenes,” in ICCV 2013, IEEE International Conference on Computer Vision, Sydney, Australia, 2013.
mehr
Abstract
People tracking in crowded real-world scenes is challenging due to
frequent and long-term occlusions. Recent tracking methods obtain
the image evidence from object (people) detectors, but typically
use off-the-shelf detectors and treat them as black box components.
In this paper we argue that for best performance one should explicitly
train people detectors on failure cases of the overall tracker instead.
To that end, we first propose a novel joint people detector that
combines a state-of-the-art single person detector with a detector
for pairs of people, which explicitly exploits common patterns of
person-person occlusions across multiple viewpoints that are a common
failure case for tracking in crowded scenes. To explicitly address
remaining failure cases of the tracker we explore two methods. First,
we analyze typical failure cases of trackers and train a detector
explicitly on those failure cases. And second, we train the detector
with the people tracker in the loop, focusing on the most common
tracker failures. We show that our joint multi-person detector significantly
improves both detection accuracy as well as tracker performance,
improving the state-of-the-art on standard benchmarks.
- PDF
- DOI
- PuRe
- BibTeX
867
Conference paper
D2
R. Benenson, M. Mathias, T. Tuytelaars, and L. van Gool
“Seeking the Strongest Rigid Detector,” in 2013 IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2013), Portland, OR, USA, 2013.
- PDF
- DOI
- PuRe
- BibTeX
868
Conference paper
D2
W.-C. Chiu and M. Fritz
“Multi-class Video Co-segmentation with a Generative Multi-video Model,” in 2013 IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2013), Portland, OR, USA, 2013.
- PDF
- DOI
- PuRe
- BibTeX
869
Conference paper
D2
J. H. Kappes, B. Andres, F. A. Hamprecht, C. Schnörr, S. Nowozin, D. Batra, S. Kim, B. X. Kausler, J. Lellmann, N. Komodakis, and C. Rother
“A Comparative Study of Modern Inference Techniques for Discrete Energy Minimization Problem,” in 2013 IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2013), Portland, OR, USA, 2013.
- PDF
- DOI
- PuRe
- BibTeX
870
Conference paper
D2
B. Pepik, M. Stark, P. Gehler, and B. Schiele
“Occlusion Patterns for Object Class Detection,” in 2013 IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2013), Portland, OR, USA, 2013.
871
Conference paper
D2
L. Pishchulin, M. Andriluka, P. Gehler, and B. Schiele
“Poselet Conditioned Pictorial Structures,” in 2013 IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2013), Portland, OR, USA, 2013.
- PDF
- DOI
- PuRe
- BibTeX
872
Conference paper
D2
E. Turetken, F. Benmansour, B. Andres, H. Pfister, and P. Fua
“Reconstructing Loopy Curvilinear Structures Using Integer Programming,” in 2013 IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2013), Portland, OR, USA, 2013.
- PDF
- DOI
- PuRe
- BibTeX
873
Conference paper
D2
Z. Zia, M. Stark, and K. Schindler
“Explicit Occlusion Modeling for 3D Object Class Representations,” in 2013 IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2013), Portland, OR, USA, 2013.
- PDF
- DOI
- PuRe
- BibTeX
874
Conference paper
D2
J. Krause, M. Stark, J. Deng, and L. Fei-Fei
“3D Object Representations for Fine-grained Categorization,” in 2013 IEEE International Conference on Computer Vision Workshops (ICCVW 2013), Sydney, Australia, 2013.
- PDF
- DOI
- PuRe
- BibTeX
875
Article
D2
C. Wojek, S. Walk, S. Roth, K. Schindler, and B. Schiele
“Monocular Visual Scene Understanding: Understanding Multi-object Traffic Scenes,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 35, no. 4, 2013.
- PDF
- DOI
- PuRe
- BibTeX
876
Article
D2
Z. Zia, M. Stark, B. Schiele, and K. Schindler
“Detailed 3D Representations for Object Recognition and Modeling,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 35, no. 11, 2013.
- PDF
- DOI
- PuRe
- BibTeX
877
Conference paper
D2
M. Malinowski and M. Fritz
“Learnable Pooling Regions for Image Classification,” in International Conference on Learning Representations Workshop Proceedings (ICLR 2013), Scottsdale, AZ, USA, 2013.
mehr
Abstract
Biologically inspired, from the early HMAX model to Spatial Pyramid Matching,
pooling has played an important role in visual recognition pipelines. Spatial
pooling, by grouping of local codes, equips these methods with a certain degree
of robustness to translation and deformation yet preserving important spatial
information. Despite the predominance of this approach in current recognition
systems, we have seen little progress to fully adapt the pooling strategy to
the task at hand. This paper proposes a model for learning task dependent
pooling scheme -- including previously proposed hand-crafted pooling schemes as
a particular instantiation. In our work, we investigate the role of different
regularization terms showing that the smooth regularization term is crucial to
achieve strong performance using the presented architecture. Finally, we
propose an efficient and parallel method to train the model. Our experiments
show improved performance over hand-crafted pooling schemes on the CIFAR-10 and
CIFAR-100 datasets -- in particular improving the state-of-the-art to 56.29% on
the latter.
878
Conference paper
D2
M. Mathias, R. Timofte, R. Benenson, and L. Van Gool
“Traffic Sign Recognition - How far are we from the solution?,” in 2013 International Joint Conference on Neural Networks (IJCNN 2013), Dallas, TX, USA, 2013.
- PDF
- DOI
- PuRe
- BibTeX
879
Conference paper
D2
K. Kunze, Y. Utsumi, S. Yuki, K. Kise, and A. Bulling
“I Know What You Are Reading - Recognition of Document Types Using Mobile Eye Tracking,” in ISWC’13, ACM International Symposium on Wearable Computers, Zurich, Switzerland, 2013.
- PDF
- DOI
- PuRe
- BibTeX
880
Proceedings
D2
J. Weickert, M. Hein, and B. Schiele
Eds., Pattern Recognition. Springer, 2013.
881
Book chapter / section
D2
D. Roggen, G. Tröster, and A. Bulling
“Signal Processing Technologies for Activity-aware Smart Textiles,” in Multidisciplinary Know-How for Smart-Textiles Developers, Philadelphia, PA: Woodhead Publishing, 2013.
mehr
Abstract
Garments made of smart textiles have an enormous potential for embedding sensors in close proximity to the body in an unobtrusive and comfortable manner. Combined with signal processing and pattern recognition technologies, complex high-level information about human behaviors or situations can be inferred from the sensor data. The goal of this chapter is to introduce the reader to the design of activity-aware systems that use body-worn sensors, such as those that can be made available through smart textiles. We start this chapter by emphasizing recent trends towards ‘}wearable{’ sensing and computing and we present several examples of activity-aware applications. Then we outline the role that smart textiles can play in activity-aware applications, but also the challenges that they pose. We conclude by discussing the design process followed to devise activity-aware systems: the choice of sensors, the available data processing methods, and the evaluation techniques. We discuss recent data processing methods that address the challenges resulting from the use of smart textiles.
- PDF
- DOI
- PuRe
- BibTeX
882
Conference paper
D2
S. Karayev, M. Fritz, and T. Darrell
“Dynamic Feature Selection for Classification on a Budget,” in Prediction with Sequential Models (ICML 2013 Workshop), Atlanta, GA, USA, 2013.
883
Conference paper
D2
J. Turner, A. Bulling, J. Alexander, and H. Gellersen
“Eye Drop: An Interaction Concept for Gaze-supported Point-to-point Content Transfer,” in Proceedings of the 12th International Conference on Mobile and Ubiquitous Multimedia (MUM 2013), Luleå, Sweden, 2013.
- PDF
- DOI
- PuRe
- BibTeX
884
Conference paper
D2
E. Velloso, A. Bulling, H. Gellersen, W. Ugulino, and H. Fuks
“Qualitative Activity Recognition of Weight Lifting Exercises,” in Proceedings of the 4th Augmented Human International Conference (AH 2013), Stuttgart, Germany, 2013.
mehr
Abstract
Research on human activity recognition has traditionally focused on
discriminating between different activities, i.e. to predict \textquoteleft}{\textquoteleft}which{\textquoteright}{\textquoteright}
activity was performed at a specific point in time. The quality of
executing an activity, the {\textquoteleft}{\textquoteleft}how (well){\textquoteright}{\textquoteright,
has only received little attention so far, even though it potentially
provides useful information for a large variety of applications,
such as sports training. In this work we first define quality of
execution and investigate three aspects that pertain to qualitative
activity recognition: the problem of specifying correct execution,
the automatic and robust detection of execution mistakes, and how
to provide feedback on the quality of execution to the user. We illustrate
our approach on the example problem of qualitatively assessing and
providing feedback on weight lifting exercises. In two user studies
we try out a sensor- and a model-based approach to qualitative activity
recognition. Our results underline the potential of model-based assessment
and the positive impact of real-time user feedback on the quality
of execution.
- PDF
- DOI
- PuRe
- BibTeX
885
Conference poster
D2
Z. Zia, M. Stark, and K. Schindler
“Towards Scene Understanding with Detailed 3D Object Representations,” Scene Understanding Workshop (SUNw 2013). 2013.
886
Conference paper
D2
J. Krause, J. Deng, M. Stark, and L. Fei-Fei
“Collecting a Large-scale Dataset of Fine-grained Cars,” in Second Workshop on Fine-Grained Visual Categorization (FGVC2), 2013.
887
Conference paper
D2
A. Chou, H. Wang, M. Stark, and D. Koller
“Modeling Instance Appearance for Recognition - Can We Do Better Than EM?,” in Structured Prediction : Tractability, Learning, and Inference (CVPR 2013 Workshop SPTLI), Portland, OR, USA, 2013.
888
Article
D2
M. Regneri, M. Rohrbach, D. Wetzel, S. Thater, B. Schiele, and M. Pinkal
“Grounding Action Descriptions in Videos,” Transactions of the Association for Computational Linguistics, vol. 1, 2013.
889
Conference paper
D2
M. Vidal, A. Bulling, and H. Gellersen
“Pursuits: Spontaneous Interaction with Displays based on Smooth Pursuit Eye Movement and Moving Targets,” in UbiComp’13, ACM International Joint Conference on Pervasive and Ubiquitous Computing, Zurich, Switzerland, 2013.
- PDF
- DOI
- PuRe
- BibTeX
890
Conference paper
D2
K. Pfeuffer, M. Vidal, J. Turner, A. Bulling, and H. Gellersen
“Pursuit Calibration: Making Gaze Calibration Less Tedious and More Flexible,” in UIST’13, ACM Symposium on User Interface Software and Technology, St. Andrews, UK, 2013.
mehr
Abstract
Eye gaze is a compelling interaction modality but requires a user
calibration before interaction can commence. State of the art procedures
require the user to fixate on a succession of calibration markers,
a task that is often experienced as difficult and tedious. We present
a novel approach, pursuit calibration, that instead uses moving targets
for calibration. Users naturally perform smooth pursuit eye movements
when they follow a moving target, and we use correlation of eye and
target movement to detect the users attention and to sample data
for calibration. Because the method knows when the users is attending
to a target, the calibration can be performed implicitly, which enables
more flexible design of the calibration task. We demonstrate this
in application examples and user studies, and show that pursuit calibration
is tolerant to interruption, can blend naturally with applications,
and is able to calibrate users without their awareness.
- PDF
- DOI
- PuRe
- BibTeX
891
Proceedings
D2
A. Bulling and R. Bednarik
Eds., 3rd International Workshop on Pervasive Eye Tracking and Mobile Eye-based Interaction. petmei.org, 2013.
- PuRe
- BibTeX
892
Proceedings
D2
A. Schmidt, A. Bulling, and C. Holz
Eds., Proceedings of the 4th Augmented Human International Conference. ACM, 2013.
mehr
Abstract
We are very happy to present the proceedings of the 4th Augmented
Human International Conference (Augmented Human 2013). Augmented
Human 2013 focuses on augmenting human capabilities through technology
for increased well-being and enjoyable human experience. The conference
is in cooperation with ACM SIGCHI, with its proceedings to be archived
in ACM\textquoteright}s Digital Library. With technological advances,
computing has progressively moved beyond the desktop into new physical
and social contexts. As physical artifacts gain new computational
behaviors, they become reprogrammable, customizable, repurposable,
and interoperable in rich ecologies and diverse contexts. They also
become more complex, and require intense design effort in order to
be functional, usable, and enjoyable. Designing such systems requires
interdisciplinary thinking. Their creation must not only encompass
software, electronics, and mechanics, but also the system{\textquoterights
physical form and behavior, its social and physical milieu, and beyond.

2012

893
Conference paper
D2
S. Karayev, T. Baumgarnter, M. Fritz, and T. Darrell
“Timely Object Recognition,” in Advances in Neural Information Processing Systems 25 (NIPS 2012), Lake Tahoe, NV, USA, 2013.
894
Conference paper
D2
M. Andriluka and L. Sigal
“Human Context: Modeling Human-Human Interactions for Monocular 3D Pose Estimation,” in Articulated Motion and Deformable Objects (AMDO 2012), Port d’Andratx Mallorca, Spain, 2012.
- PDF
- DOI
- PuRe
- BibTeX
895
Conference paper
D2
S. Ebert, M. Fritz, and B. Schiele
“Semi-supervised Learning on a Budget: Scaling Up to Large Datasets,” in Computer Vision - ACCV 2012, Daejeon, Korea, 2012, vol. 1.
- PDF
- DOI
- PuRe
- BibTeX
896
Conference paper
D2
F. Galasso, R. Cipolla, and B. Schiele
“Video Segmentation with Superpixels,” in Computer Vision - ACCV 2012, Daejeon, Korea, 2013.
- PDF
- DOI
- PuRe
- BibTeX
897
Conference paper
D2
K. Rematas, M. Fritz, and T. Tuytelaars
“The Pooled NBNN Kernel: Beyond Image-to-Class and Image-to-Image,” in Computer Vision - ACCV 2012, Daejeon, Korea, 2013.
- PDF
- DOI
- PuRe
- BibTeX
898
Conference paper
D2
T. Gao, M. Stark, and D. Koller
“What Makes a Good Detector? - Structured Priors for Learning from Few Examples,” in Computer Vision - ECCV 2012, Florence, Italy, 2012.
- PDF
- DOI
- PuRe
- BibTeX
899
Conference paper
D2
B. X. Kausler, S. Martin, B. Andres, M. Lindner, U. Köthe, H. Leitte, H. Wittbrodt, L. Hufnagel, and F. A. Hamprecht
“A Discrete Chain Graph Model for 3d+t Cell Tracking with High Misdetection Robustness,” in Computer Vision - ECCV 2012, Florence, Italy, 2012.
- PDF
- DOI
- PuRe
- BibTeX
900
Conference paper
D2
W. Li and M. Fritz
“Recognizing Materials from Virtual Examples,” in Computer Vision - ECCV 2012, Florence, Italy, 2012.
- PDF
- DOI
- PuRe
- BibTeX
901
Conference paper
D2
B. Pepik, P. Gehler, M. Stark, and B. Schiele
“3D2PM - 3D Deformable Part Models,” in Computer Vision - ECCV 2012, Firenze, Italy, 2012.
902
Conference paper
D2
M. Rohrbach, M. Regneri, M. Andriluka, S. Amin, M. Pinkal, and B. Schiele
“Script Data for Attribute-based Recognition of Composite Activities,” in Computer Vision - ECCV 2012, Firenze, Italy, 2012.
- PDF
- DOI
- PuRe
- BibTeX
903
Conference paper
D2
H. O. Song, S. Zickler, T. Althoff, R. B. Girshick, M. Fritz, C. Geyer, P. F. Felzenszwalb, and T. Darrell
“Sparselet Models for Efficient Multiclass Object Detection,” in Computer Vision - ECCV 2012, Florence, Italy, 2012, vol. 2.
- PDF
- DOI
- PuRe
- BibTeX
904
Conference paper
D2
W. Susanto, M. Rohrbach, and B. Schiele
“3D Object Detection with Multiple Kinects,” in Computer Vision - ECCV 2012, Firenze, Italy, 2012.
- PDF
- DOI
- PuRe
- BibTeX
905
Conference paper
D2
M. Stark, J. Krause, B. Pepik, D. Meger, J. J. Little, B. Schiele, and D. Koller
“Fine-grained Categorization for 3D Scene Understanding,” in Electronic Proceedings of the British Machine Vision Conference 2012 (BMVC 2012), Surrey, UK, 2012.
- PDF
- DOI
- PuRe
- BibTeX
906
Conference paper
D2
S. Tang, M. Andriluka, and B. Schiele
“Detection and Tracking of Occluded People,” in Electronic Proceedings of the British Machine Vision Conference 2012 (BMVC 2012), Surrey, UK, 2012.
- PDF
- DOI
- PuRe
- BibTeX
907
Conference paper
D2
S. Ebert, M. Fritz, and B. Schiele
“RALF: A Reinforced Active Learning Formulation for Object Class Recognition,” in 2012 IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2012), Providence, RI, 2012.
- PDF
- DOI
- PuRe
- BibTeX
908
Conference paper
D2
B. Pepik, M. Stark, P. Gehler, and B. Schiele
“Teaching 3D Geometry to Deformable Part Models,” in 2012 IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2012), Providence, RI, 2012.
mehr
Abstract
Current object class recognition systems typically target 2D bounding box localization, encouraged by benchmark data sets, such as Pascal VOC. While this seems suitable for the detection of individual objects, higher-level applications such as 3D scene understanding or 3D object tracking would benefit from more fine-grained object hypotheses incorporating 3D geometric information, such as viewpoints or the locations of individual parts. In this paper, we help narrowing the representational gap between the ideal input of a scene understanding system and object class detector output, by designing a detector particularly tailored towards 3D geometric reasoning. In particular, we extend the successful discriminatively trained deformable part models to include both estimates of viewpoint and 3D parts that are consistent across viewpoints. We experimentally verify that adding 3D geometric information comes at minimal performance loss w.r.t. 2D bounding box localization, but outperforms prior work in 3D viewpoint estimation and ultra-wide baseline matching.
- PDF
- DOI
- PuRe
- BibTeX
909
Conference paper
D2D4
L. Pishchulin, A. Jain, M. Andriluka, T. Thormaehlen, and B. Schiele
“Articulated People Detection and Pose Estimation: Reshaping the Future,” in 2012 IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2012), Providence, RI, 2012.
mehr
Abstract
State-of-the-art methods for human detection and pose estimation require many training samples for best performance. While large, manually collected datasets exist, the captured variations w.r.t. appearance, shape and pose are often uncontrolled thus limiting the overall performance. In order to overcome this limitation we propose a new technique to extend an existing training set that allows to
explicitly control pose and shape variations. For this we build on recent advances in
computer graphics to generate samples with realistic appearance and background
while modifying body shape and pose.
We validate the effectiveness of our approach on the task of articulated human detection and articulated pose estimation.
We report close to state of the art results on the popular Image Parsing human pose estimation benchmark and demonstrate superior performance for articulated human detection. In addition we define a new challenge of combined articulated human detection and pose estimation in real-world scenes.
- PDF
- DOI
- PuRe
- BibTeX
910
Conference paper
D2
M. Rohrbach, S. Amin, M. Andriluka, and B. Schiele
“A Database for Fine Grained Activity Detection of Cooking Activities,” in 2012 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Providence, RI, USA, 2012, vol. 2.
- PDF
- DOI
- PuRe
- BibTeX
911
Article
D2
P. Dollár, C. Wojek, and B. Schiele
“Pedestrian Detection: An Evaluation of the State of the Art,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 34, no. 4, 2012.
- PDF
- DOI
- PuRe
- BibTeX
912
Article
D2
M. Andriluka, S. Roth, and B. Schiele
“Discriminative Appearance Models for Pictorial Structures,” International Journal of Computer Vision, vol. 99, no. 3, 2012.
mehr
Abstract
In this paper we consider people detection and articulated pose estimation, two closely related and challenging problems in computer vision. Conceptually, both of these problems can be addressed within the pictorial structures framework (Felzenszwalb and Huttenlocher in Int. J. Comput. Vis. 61(1):55–79, 2005; Fischler and Elschlager in IEEE Trans. Comput. C-22(1):67–92, 1973), even though previous approaches have not shown such generality. A principal difficulty for such a general approach is to model the appearance of body parts. The model has to be discriminative enough to enable reliable detection in cluttered scenes and general enough to capture highly variable appearance. Therefore, as the first important component of our approach, we propose a discriminative appearance model based on densely sampled local descriptors and AdaBoost classifiers. Secondly, we interpret the normalized margin of each classifier as likelihood in a generative model and compute marginal posteriors for each part using belief propagation. Thirdly, non-Gaussian relationships between parts are represented as Gaussians in the coordinate system of the joint between the parts. Additionally, in order to cope with shortcomings of tree-based pictorial structures models, we augment our model with additional repulsive factors in order to discourage overcounting of image evidence. We demonstrate that the combination of these components within the pictorial structures framework results in a generic model that yields state-of-the-art performance for several datasets on a variety of tasks: people detection, upper body pose estimation, and full body pose estimation.
- PDF
- DOI
- PuRe
- BibTeX
913
Article
D2
S. Miller, J. van den Berg, M. Fritz, T. Darrell, K. Goldberg, and P. Abbeel
“A Geometric Approach To Robotic Laundry Folding,” International Journal of Robotics Research, vol. 31, no. 2, 2012.
- PDF
- DOI
- PuRe
- BibTeX
914
Proceedings
D2
K. Rematas, M. Fritz, and T. Tuytelaars
Kernel Density Topic Models: Visual Topics Without Visual Words. NIPS, 2012.
915
Conference paper
D2
S. Ebert, M. Fritz, and B. Schiele
“Active Metric Learning for Object Recognition,” in Pattern Recognition (DAGM-OAGM 2012), Graz, Austria, 2012.
- PDF
- DOI
- PuRe
- BibTeX
916
Thesis
D2IMPR-CS
S. Ebert
“Semi-supervised Learning for Image Classification,” Universität des Saarlandes, Saarbrücken, 2012.
mehr
Abstract
Object class recognition is an active topic in computer vision still
presenting many challenges. In most approaches, this task is addressed
by supervised learning algorithms that need a large quantity of labels
to perform well. This leads either to small datasets (< 10,000 images)
that capture only a subset of the real-world class distribution (but
with a controlled and verified labeling procedure), or to large datasets
that are more representative but also add more label noise. Therefore,
semi-supervised learning is a promising direction. It requires only
few labels while simultaneously making use of the vast amount of
images available today. We address object class recognition with
semi-supervised learning. These algorithms depend on the underlying
structure given by the data, the image description, and the similarity
measure, and the quality of the labels. This insight leads to the
main research questions of this thesis: Is the structure given by
labeled and unlabeled data more important than the algorithm itself?
Can we improve this neighborhood structure by a better similarity
metric or with more representative unlabeled data? Is there a connection
between the quality of labels and the overall performance and how
can we get more representative labels? We answer all these questions,
i.e., we provide an extensive evaluation, we propose several graph
improvements, and we introduce a novel active learning framework
to get more representative labels.
- PDF
- DOI
- PuRe
- BibTeX

2011

917
Conference paper
D2
U. Blanke, R. Rehner, and B. Schiele
“South by South-east or Sitting at the Desk: Can Orientation be a Place?,” in 15th Annual International Symposium on Wearable Computers (ISWC 2011), San Francisco, CA, 2011.
mehr
Abstract
Location is a key information for context-aware systems. While coarse-grained
indoor location estimates may be obtained quite easily (e.g. based on WiFi or
GSM), finer-grained estimates typically require additional infrastructure (e.g.
ultrasound). This work explores an approach to estimate significant places,
e.g., at the fridge, with no additional setup or infrastructure. We use a
pocket-based inertial measurement sensor, which can be found in many recent
phones. We analyze how the spatial layout such as geographic orientation of
buildings, arrangement and type of furniture can serve as the basis to estimate
typical places in a daily scenario. Initial experiments reveal that our
approach can detect fine-grained locations without relying on any
infrastructure or additional devices.
- PDF
- DOI
- PuRe
- BibTeX
918
Conference paper
D2
P. Gehler, C. Rother, M. Kiefel, L. Zhang, and B. Schölkopf
“Recovering Intrinsic Images with a Global Sparsity Prior on Reflectance,” in Advances in Neural Information Processing Systems 24 (NIPS 2011), Granada, Spain, 2011.
mehr
Abstract
We address the challenging task of decoupling material properties from lighting
properties given a single image. In the last two decades virtually all works
have concentrated on exploiting edge information to address this problem. We
take a different route by introducing a new prior on reflectance, that models
reflectance values as being drawn from a sparse set of basis colors. This
results in a Random Field model with global, latent variables (basis colors)
and pixel-accurate output reflectance values. We show that without edge
information high-quality results can be achieved, that are on par with methods
exploiting this source of information. Finally, we are able to improve on
state-of-the-art results by integrating edge information into our model. We
believe that our new approach is an excellent starting point for future
developments in this field.
919
Conference paper
D2
A. Geiger, C. Wojek, and R. Urtasun
“Joint 3D Estimation of Objects and Scene Layout,” in Advances in Neural Information Processing Systems 24 (NIPS 2011), Granada, Spain, 2011.
920
Conference paper
D2
S. Walk, K. Schindler, and B. Schiele
“Disparity Statistics for Pedestrian Detection: Combining Appearance, Motion and Stereo,” in Computer Vision - ECCV 2010, Heraklion, Crete, Greece, 2010.
- PDF
- DOI
- PuRe
- BibTeX
921
Conference paper
D2
C. Wojek, S. Roth, K. Schindler, and B. Schiele
“Monocular 3D Scene Modeling and Inference: Understanding Multi-Object Traffic Scenes,” in Computer Vision - ECCV 2010, Heraklion, Crete, Greece, 2010.
- PDF
- DOI
- PuRe
- BibTeX
922
Conference paper
D2
K. Saenko, S. Karayev, Y. Yia, A. Shyr, A. Janoch, J. Long, M. Fritz, and T. Darrell
“Practical 3-D Object Detection Using Category and Instance-level Appearance Models,” in 2011 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS’11), San Francisco, USA, 2011.
- PDF
- DOI
- PuRe
- BibTeX
923
Conference paper
D2
P. C. Wang, S. Miller, M. Fritz, T. Darrell, and P. Abbeel
“Perception for the Manipulation of Socks,” in 2011 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2011), San Francisco, USA, 2011.
- PDF
- DOI
- PuRe
- BibTeX
924
Conference paper
D2
S. Karayev, M. Fritz, S. Fidler, and T. Darrell
“A Probabilistic Model for Recursive Factorized Image Features,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2011), Colorado Springs, CO, 2011.
- PDF
- DOI
- PuRe
- BibTeX
925
Conference paper
D2D4
L. Pishchulin, A. Jain, C. Wojek, M. Andriluka, T. Thormaehlen, and B. Schiele
“Learning People Detection Models from Few Training Samples,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2011), Colorado Springs, CO, USA, 2011.
- PDF
- DOI
- PuRe
- BibTeX
926
Conference paper
D2
M. Rohrbach, M. Stark, and B. Schiele
“Evaluating Knowledge Transfer and Zero-shot Learning in a Large-scale Setting,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2011), Colorado Springs, CO, USA, 2011.
- PDF
- DOI
- PuRe
- BibTeX
927
Conference paper
D2
C. Wojek, S. Walk, S. Roth, and B. Schiele
“Monocular 3D Scene Understanding with Explicit Occlusion Reasoning,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2011), Colorado Springs, CO, USA, 2011.
- PDF
- DOI
- PuRe
- BibTeX
928
Conference paper
D2
T. Gass, L. Pishchulin, P. Dreuw, and H. Ney
“Warp that Smile on your Face: Optimal and Smooth Deformations for Face Recognition,” in IEEE International Conference on Automatic Face & Gesture Recognition and Workshops (FG 2011), Santa Barbara, USA, 2011.
- PDF
- DOI
- PuRe
- BibTeX
929
Conference paper
D2
A. Janoch, S. Karayev, Y. Jia, J. T. Barron, M. Fritz, K. Saenko, and T. Darrell
“A Category-level 3-D Object Dataset: Putting the Kinect to Work,” in 2011 IEEE International Conference on Computer Vision (ICCV 2011), Barcelona, Spain, 2011.
- PDF
- DOI
- PuRe
- BibTeX
930
Conference paper
D2
T. Tuytelaars, M. Fritz, K. Saenko, and T. Darrell
“The NBNN Kernel,” in IEEE International Conference on Computer Vision (ICCV 2011), Barcelona, Spain, 2011.
- PDF
- DOI
- PuRe
- BibTeX
931
Conference paper
D2
M. Z. Zia, M. Stark, B. Schiele, and K. Schindler
“Revisiting 3D Geometric Models for Accurate Object Shape and Pose,” in IEEE International Conference on Computer Vision (ICCV 3dRR 2011), Barcelona, Spain, 2011.
mehr
Abstract
Geometric 3D reasoning has received renewed attention recently, in the context
of visual scene understanding. The level of geometric detail, however, is
typically limited to qualitative or coarse-grained quantitative
representations. This is linked to the fact that today's object class detectors
are tuned towards robust 2D matching rather than accurate 3D pose estimation,
encouraged by 2D bounding box-based benchmarks such as Pascal VOC. In this
paper, we therefore revisit ideas from the early days of computer vision,
namely, 3D geometric object class representations for recognition. These
representations can recover geometrically far more accurate object hypotheses
than just 2D bounding boxes, including relative 3D positions of object parts.
In combination with recent robust techniques for shape description and
inference, our approach outperforms state-of-the-art results in 3D pose
estimation, while at the same time improving 2D localization. In a series of
experiments, we analyze our approach in detail, and demonstrate novel
applications enabled by our geometric object class representation, such as
fine-grained categorization of cars according to their 3D geometry and
ultra-wide baseline matching.
- PDF
- DOI
- PuRe
- BibTeX
932
Conference paper
D2
H. O. Song, M. Fritz, C. Gu, and T. Darrell
“Visual Grasp Affordances From Appearance-based Cues,” in 2011 IEEE International Conference on Computer Vision (ICCW 2011), Barcelona, Spain, 2011.
- PDF
- DOI
- PuRe
- BibTeX
933
Conference paper
D2
W.-C. Chiu, U. Blanke, and M. Fritz
“I Spy with my Little Eye: Learning Optimal Filters for Cross-Modal Stereo under Projected Patterns,” in 2011 IEEE International Conference on Computer Vision (WS 2011), Barcelona, Spain, 2011.
- PDF
- DOI
- PuRe
- BibTeX
934
Article
D2
C. G. Keller, M. Enzweiler, M. Rohrbach, D. F. Llorca, C. Schnörr, and D. M. Gavrila
“The Benefits of Dense Stereo for Pedestrian Detection,” IEEE Transactions on Intelligent Transportation Systems, vol. 12, no. 4, 2011.
mehr
Abstract
This paper presents a novel pedestrian detection system for intelligent
vehicles. We propose the use of dense stereo for both the generation of regions
of interest and pedestrian classification. Dense stereo allows the dynamic
estimation of camera parameters and the road profile, which, in turn, provides
strong scene constraints on possible pedestrian locations. For classification,
we extract spatial features (gradient orientation histograms) directly from
dense depth and intensity images. Both modalities are represented in terms of
individual feature spaces, in which discriminative classifiers (linear support
vector machines) are learned. We refrain from the construction of a joint
feature space but instead employ a fusion of depth and intensity on the
classifier level. Our experiments involve challenging image data captured in
complex urban environments (i.e., undulating roads and speed bumps). Our
results show a performance improvement by up to a factor of 7.5 at the
classification level and up to a factor of 5 at the tracking level (reduction
in false alarms at constant detection rates) over a system with static scene
constraints and intensity-only classification.
- PDF
- DOI
- PuRe
- BibTeX
935
Article
D2
M. Stikic, D. Larlus, S. Ebert, and B. Schiele
“Weakly Supervised Recognition of Daily Life Activities with Wearable Sensors,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 33, no. 12, 2011.
- PDF
- DOI
- PuRe
- BibTeX
936
Conference paper
D2
S. Ebert, M. Fritz, and B. Schiele
“Pick your Neighborhood - Improving Labels and Neighborhood Structure for Label Propagation,” in Pattern Recognition (DAGM 2011), Frankfurt/Main, Germany, 2011.
- PDF
- DOI
- PuRe
- BibTeX
937
Conference paper
D2
L. Pishchulin, T. Gass, P. Dreuw, and H. Ney
“Image Warping for Face Recognition: From Local Optimality Towards Global Optimization,” in Pattern Recognition (Proc. IbPRIA 2011), Las Palmas de Gran Canaria, Spain, 2012, vol. 45, no. 9.
- PDF
- DOI
- PuRe
- BibTeX
938
Conference paper
D2
L. Pishchulin, T. Gass, P. Dreuw, and H. Ney
“The Fast and the Flexible: Extended Pseudo Two-dimensional Warping for Face Recognition,” in Pattern Recognition and Image Analysis (IbPRIA 2011), Las Palmas, Gran Canaria, Spain, 2011.
- PDF
- DOI
- PuRe
- BibTeX
939
Conference paper
D2
B. Tessendorf, A. Bulling, D. Roggen, T. Stiefmeier, M. Feilner, P. Derleth, and G. Tröster
“Recognition of Hearing Needs From Body and Eye Movements to Improve Hearing Instruments,” in Pervasive Computing, San Francisco, CA, 2011.
mehr
Abstract
Hearing instruments (HIs) have emerged as true pervasive computers
as they continuously adapt the hearing program to the user\textquoterights
context. However, current HIs are not able to distinguish different
hearing needs in the same acoustic environment. In this work, we
explore how information derived from body and eye movements can be
used to improve the recognition of such hearing needs. We conduct
an experiment to provoke an acoustic environment in which different
hearing needs arise: active conversation and working while colleagues
are having a conversation in a noisy office environment. We record
body movements on nine body locations, eye movements using electrooculography
(EOG), and sound using commercial HIs for eleven participants. Using
a support vector machine (SVM) classifier and person-independent
training we improve the accuracy of 77% based on sound to an accuracy
of 92% using body movements. With a view to a future implementation
into a HI we then perform a detailed analysis of the sensors attached
to the head. We achieve the best accuracy of 86% using eye movements
compared to 84% for head movements. Our work demonstrates the potential
of additional sensor modalities for future HIs and motivates to investigate
the wider applicability of this approach on further hearing situations
and needs.
- PDF
- DOI
- PuRe
- BibTeX
940
Conference paper
D2
F. Dinuzzo, C. S. Ong, P. Gehler, and G. Pillonetto
“Learning Output Kernels with Block Coordinate Descent,” in Proceedings of the 28th Internationl Conference on Machine Learning (ICML 2011), Bellevue, Wash., 2011.
mehr
Abstract
We propose a method to learn simultaneously a vector-valued function and a
kernel between its components. The obtained kernel can be used both to improve
learning performance and to reveal structures in the output space which may be
important in their own right. Our method is based on the solution of a suitable
regularization problem over a reproducing kernel Hilbert space of vector-valued
functions. Although the regularized risk functional is non-convex, we show that
it is invex, implying that all local minimizers are global minimizers. We
derive a block-wise coordinate descent method that efficiently exploits the
structure of the objective functional. Then, we empirically demonstrate that
the proposed method can improve classification accuracy. Finally, we provide a
visual interpretation of the learned kernel matrix for some well known
datasets.
941
Conference paper
D2
W.-C. Chiu, U. Blanke, and M. Fritz
“Improving the Kinect by Cross-modal Stereo,” in Proceedings of the British Machine Vision Conference 2011 (BMVC 2011), Dundee, UK, 2011.
- PDF
- DOI
- PuRe
- BibTeX
942
Conference paper
D2
A. Lehmann, P. Gehler, and L. Van Gool
“Branch&Rank: Non-linear Object Detection,” in Proceedings of the British Machine Vision Conference 2011 (BMVC 2011), Dundee, Scotland, 2011.
mehr
Abstract
Branch&rank is an object detection scheme that overcomes the inherent
limitation of branch&bound: this method works with arbitrary (classifier)
functions whereas tight bounds exist only for simple functions. Objects are
usually detected with less than 100 classifier evaluation, which paves the way
for using strong (and thus costly) classifiers: We utilize non-linear SVMs with
RBF- 2 kernels without a cascade-like approximation. Our approach features
three key components: a ranking function that operates on sets of hypotheses
and a grouping of these into different tasks. Detection efficiency results from
adaptively sub-dividing the object search space into decreasingly smaller sets.
This is inherited from branch&bound, while the ranking function supersedes a
tight bound which is often unavailable (except for too simple function
classes). The grouping makes the system effective: it separates image
classification from object recognition, yet combines them in a single,
structured SVM formulation. A novel aspect of branch&rank is that a better
ranking function is expected to decrease the number of classifier calls during
detection. We demonstrate the algorithmic properties using the VOC'07 dataset.
- PDF
- DOI
- PuRe
- BibTeX
943
Conference paper
D2
D. Meger, C. Wojek, B. Schiele, and J. J. Little
“Explicit Occlusion Reasoning for 3D Object Detection,” in Proceedings of the British Machine Vision Conference 2011 (BMVC 2011), Dundee, Scotland, 2011.
- PDF
- DOI
- PuRe
- BibTeX
944
Conference paper
D2
L. Pishchulin, A. Jain, C. Wojek, T. Thormaehlen, and B. Schiele
“In Good Shape: Robust People Detection Based on Appearance and Shape,” in Proceedings of the British Machine Vision Conference 2011 (BMVC 2011), Dundee, Scotland, 2011.
945
Book chapter / section
D2
M. Andriluka, L. Sigal, and M. Black
“Benchmark Datasets for Pose Estimation and Tracking,” in Visual Analysis of Humans: Looking at People, New York, NY: Springer, 2011.
- PDF
- DOI
- PuRe
- BibTeX

2010

946
Conference paper
D2
M. Stark, M. Goesele, and B. Schiele
“Back to the Future: Learning Shape Models from 3D CAD Data,” in 21st British Machine Vision Conference (BMVC 2010), Aberystwyth, UK, 2010.
mehr
Abstract
Recognizing 3D objects from arbitrary view points is one of the
most fundamental problems in computer vision. A major challenge lies
in the transition between the 3D geometry of objects and 2D
representations that can be robustly matched to natural images. Most
approaches thus rely on 2D natural images either as the sole source of
training data for building an implicit 3D representation, or by
enriching 3D models with natural image features.
In this paper, we go back to the ideas from the early days of computer
vision, by using 3D object models as the only source of information for
building a multi-view object class detector. In particular, we use
these models for learning 2D shape that can be robustly matched to 2D
natural images. Our experiments confirm the validity of our approach,
which outperforms current state-of-the-art techniques on a multi-view
detection data set.
- PDF
- DOI
- PuRe
- BibTeX
947
Conference paper
D2
U. Blanke, M. Kreil, B. Schiele, P. Lukowicz, B. Sick, and T. Gruber
“All for one or one for all? – Combining Heterogeneous Features for Activity Spotting,” in 2010 8th IEEE International Conference on Pervasive Computing and Communications workshops : PerCom workshops 2010 : 7th IEEE International Workshop on Context Modeling and Reasoning (CoMoRea 2010), Mannheim, Germany, 2010.
- PDF
- DOI
- PuRe
- BibTeX
948
Conference paper
D2
M. Fritz, K. Saenko, and T. Darrell
“Size Matters: Metric Visual Search Constraints from Monocular Metadata,” in Advances in Neural Information Processing Systems 23 (NIPS 2010), Vancouver, Canada, 2010.
949
Book chapter / section
D2
D. Skocaj, K. Matej, A. Vrecko, A. Leonardis, M. Fritz, M. Stark, B. Schiele, S. Hongeng, and J. L. Wyatt
“Multi-Modal Learning,” in Cognitive Systems, Berlin: Springer, 2010.
950
Article
D2
M. Fritz, G.-J. M. Kruijff, and B. Schiele
“Tutor-based Learning of Visual Categories Using Different Levels of Supervision,” Computer Vision and Image Understanding, vol. 114, no. 5, 2010.
- PDF
- DOI
- PuRe
- BibTeX
951
Conference paper
D2
S. Ebert, D. Larlus, and B. Schiele
“Extracting Structures in Image Collections for Object Recognition,” in Computer Vision - ECCV 2010, Crete, Greece, 2010.
mehr
Abstract
Many computer vision methods rely on annotated image databases without taking
advantage of the increasing number of unlabeled images available. This paper
explores an alternative approach involving unsupervised structure discovery and
semi-supervised learning (SSL) in image collections. Focusing on object
classes, the ﬁrst part of the paper contributes with an extensive evaluation of
state-of-the-art image representations underlining the decisive inﬂuence of the
local neighborhood structure, its direct consequences on SSL results, and the
importance of developing powerful object representations. In a second part, we
propose and explore promising directions to improve results by looking at the
local topology between images and feature combination strategies.
- PDF
- DOI
- PuRe
- BibTeX
952
Conference paper
D2
M. Rohrbach, M. Stark, G. Szarvas, and B. Schiele
“Combining Language Sources and Robust Semantic Relatedness for Attribute-Based Knowledge Transfer,” in First International Workshop on Parts and Attributes in conjunction with ECCV 2010, Crete, Greece, 2010.
mehr
Abstract
Knowledge transfer between object classes has been identified as an important
tool for scalable recognition. However, determining which knowledge to transfer
where remains a key challenge. While most approaches employ varying levels of
human supervision, we follow the idea of mining linguistic knowledge bases to
automatically infer transferable knowledge. In contrast to previous work, we
explicitly aim to design robust semantic relatedness measures and to combine
different language sources for attribute-based knowledge transfer. On the
challenging Animals with Attributes (AwA) data set, we report largely improved
attribute-based zero-shot object class recognition performance that matches the
performance of human supervision.
953
Conference paper
D2
M. Andriluka, P. Schnitzspan, J. Meyer, S. Kohlbrecher, K. Petersen, O. von Stryk, S. Roth, and B. Schiele
“Vision Based Victim Detection from Unmanned Aerial Vehicles,” in 2010 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Taipei, Taiwan, 2010.
mehr
Abstract
Finding injured humans is one of the primary
goals of any search and rescue operation. The aim of this paper
is to address the task of automatically finding people lying on
the ground in images taken from the on-board camera of an
unmanned aerial vehicle (UAV).
In this paper we evaluate various state-of-the-art visual
people detection methods in the context of vision based victim
detection from an UAV. The top performing approaches in
this comparison are those that rely on flexible part-based
representations and discriminatively trained part detectors. We
discuss their strengths and weaknesses and demonstrate that by
combining multiple models we can increase the reliability of the
system. We also demonstrate that the detection performance
can be substantially improved by integrating the height and
pitch information provided by on-board sensors. Jointly these
improvements allow us to significantly boost the detection
performance over the current de-facto standard, which provides
a substantial step towards making autonomous victim detection
for UAVs practical.
- PDF
- DOI
- PuRe
- BibTeX
954
Conference paper
D2
M. Andriluka, S. Roth, and B. Schiele
“Monocular 3D Pose Estimation and Tracking by Detection,” in 2010 IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2010), San Francisco, USA, 2010.
mehr
Abstract
Automatic recovery of 3D human pose from monocular image sequences is a
challenging and important research topic with numerous applications. Although
current methods are able to recover 3D pose for a single person in controlled
environments, they are severely challenged by real-world scenarios, such as
crowded street scenes. To address this problem, we propose a three-stage
process building on a number of recent advances. The first stage obtains an
initial estimate of the 2D articulation and viewpoint of the person from single
frames. The second stage allows early data association across frames based on
tracking-by-detection. These two stages successfully accumulate the available
2D image evidence into robust estimates of 2D limb positions over short image
sequences (= tracklets). The third and final stage uses those tracklet-based
estimates as robust image observations to reliably recover 3D pose. We
demonstrate state-of-the-art performance on the HumanEva II benchmark, and also
show the applicability of our approach to articulated 3D tracking in realistic
street conditions.
- PDF
- DOI
- PuRe
- BibTeX
955
Conference paper
D2
M. Enzweiler, A. Eigenstetter, B. Schiele, and D. M. Gavrila
“Multi-cue Pedestrian Classification with Partial Occlusion Handling,” in 2010 IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2010), San Francisco, CA, 2010.
- PDF
- DOI
- PuRe
- BibTeX
956
Conference paper
D2
M. Rohrbach, M. Stark, G. Szarvas, I. Gurevych, and B. Schiele
“What helps Where - and Why? Semantic Relatedness for Knowledge Transfer,” in 2010 IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2010), San Francisco, USA, 2010.
mehr
Abstract
Remarkable performance has been reported to recognize single object classes.
Scalability to large numbers of classes however remains an important challenge
for today's recognition methods. Several authors have promoted knowledge
transfer between classes as a key ingredient to address this challenge.
However, in previous work the decision which knowledge to transfer has required
either manual supervision or at least a few training examples limiting the
scalability of these approaches. In this work we explicitly address the
question of how to automatically decide which information to transfer between
classes without the need of any human intervention. For this we tap into
linguistic knowledge bases to provide the semantic link between sources (what)
and targets (where) of knowledge transfer. We provide a rigorous experimental
evaluation of different knowledge bases and state-of-the-art techniques from
Natural Language Processing which goes far beyond the limited use of language
in related work. We also give insights into the applicability (why) of
different knowledge sources and similarity measures for knowledge transfer.
- PDF
- DOI
- PuRe
- BibTeX
957
Conference paper
D2
P. Schnitzspan, S. Roth, and B. Schiele
“Automatic Discovery of Meaningful Object Parts with Latent CRFs,” in 2010 IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2010), San Francisco, USA, 2010.
mehr
Abstract
Object recognition is challenging due to high intra-class variability caused,
e.g., by articulation, viewpoint changes, and partial occlusion. Successful
methods need to strike a balance between being flexible enough to model such
variation and discriminative enough to detect objects in cluttered, real world
scenes. Motivated by these challenges we propose a latent conditional random
field (CRF) based on a flexible assembly of parts. By modeling part labels as
hidden nodes and developing an EM algorithm for learning from class labels
alone, this new approach enables the automatic discovery of semantically
meaningful object part representations. To increase the flexibility and
expressiveness of the model, we learn the pairwise structure of the underlying
graphical model at the level of object part interactions. Efficient
gradient-based techniques are used to estimate the structure of the domain of
interest and carried forward to the multi-label or object part case. Our
experiments illustrate the meaningfulness of the discovered parts and
demonstrate state-of-the-art performance of the approach.
- PDF
- DOI
- PuRe
- BibTeX
958
Conference paper
D2
S. Walk, N. Majer, K. Schindler, and B. Schiele
“New Features and Insights for Pedestrian Detection,” in 2010 IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2010), San Francisco, USA, 2010.
- PDF
- DOI
- PuRe
- BibTeX
959
Conference paper
D2
U. Steinhoff and B. Schiele
“Dead Reckoning from the Pocket - An Experimental Study,” in IEEE 2010 International Conference on Pervasive Computing and Communications (PerCom 2010), Mannheim, Germany, 2010.
- PDF
- DOI
- PuRe
- BibTeX
960
Conference paper
D2
U. Blanke and B. Schiele
“Towards Human Motion Capturing using Gyroscopeless Orientation Estimation,” in International Symposium on Wearable Computers 2010 (ISCW 2010), Seoul, Korea, 2010.
- PDF
- DOI
- PuRe
- BibTeX
961
Conference paper
D2
U. Blanke and B. Schiele
“Remember and Transfer what you have Learned - Recognizing Composite Activities based on Activity Spotting,” in International Symposium on Wearable Computers 2010 (ISWC 2010), Seoul, Korea, 2010.
- PDF
- DOI
- PuRe
- BibTeX
962
Conference paper
D2
J. Meyer, P. Schnitzspan, S. Kohlbrecher, K. Petersen, O. Schwahn, M. Andriluka, U. Klingauf, S. Roth, B. Schiele, and O. von Stryk
“A Semantic World Model for Urban Search and Rescue Based on Heterogeneous Sensors,” in RoboCup 2010, 14th International RoboCup Symposium, Singapore, 2010.
- PDF
- DOI
- PuRe
- BibTeX
963
Conference paper
D2
M. Rohrbach, M. Stark, G. Szarvas, and B. Schiele
“Combining Language Sources and Robust Semantic Relatedness for Attribute-based Knowledge Transfer,” in Trends and Topics in Computer Vision (ECCV 2010 Workshops), Heraklion, Crete, Greece, 2012, vol. 1.
- PDF
- DOI
- PuRe
- BibTeX
964
Conference paper
D2
C. Jung, R. Tausch, and C. Wojek
“Real-time Full-body Visual Traits Recognition from Image Sequences,” in VMV 2010, Siegen, Germany, 2010.
- PDF
- DOI
- PuRe
- BibTeX

2004

965
Conference paper
D2
N. Kern, S. Antifakos, B. Schiele, and A. Schwaninger
“A Model for Human Interruptability: Experimental Evaluation and Automatic Estimation from Wearable Sensors,” in Eighth International Symposium on Wearable Computers (ISWC 2004), Arlington, VA, USA, 2004.
- PDF
- DOI
- PuRe
- BibTeX
966
Conference paper
D2
F. Michahelles, R. Wicki, and B. Schiele
“Less Contact: Heart-rate Detection Without Even Touching the User,” in Eighth International Symposium on Wearable Computers (ISWC 2004), Arlington, VA, USA, 2004.
- PDF
- DOI
- PuRe
- BibTeX

Abstract

Abstract

Abstract

Abstract

Abstract

Abstract

Abstract

Abstract

Abstract

Abstract

Abstract

Abstract

Abstract

Abstract

Abstract

Abstract

Abstract

Abstract

Abstract

Abstract

Abstract

Abstract

Abstract

Abstract

Abstract

Abstract

Abstract

Abstract

Abstract

Abstract

Abstract

Abstract

Abstract

Abstract

Abstract

Abstract

Abstract

Abstract

Abstract

Abstract

Abstract

Abstract

Abstract

Abstract

Abstract

Abstract

Abstract

Abstract

Abstract

Abstract

Abstract

Abstract

Abstract

Abstract

Abstract

Abstract

Abstract

Abstract

Abstract

Abstract

Abstract

Abstract

Abstract

Abstract

Abstract

Abstract

Abstract

Abstract

Abstract

Abstract

Abstract

Abstract

Abstract

Abstract

Abstract

Abstract

Abstract

Abstract

Abstract

Abstract