MSc Steffen Jung
- Address
- Max-Planck-Institut für Informatik
Saarland Informatics Campus
Campus E1 4
66123 Saarbrücken - Location
- E1 4 - 624
- Phone
- +49 681 9325 2131
- Fax
- +49 681 9325 2099
- Get email via email
Vision language models (VLMs) have drastically changed the computer vision
model landscape in only a few years, opening an exciting array of new
applications from zero-shot image classification, over to image captioning, and
visual question answering. Unlike pure vision models, they offer an intuitive
way to access visual content through language prompting. The wide applicability
of such models encourages us to ask whether they also align with human vision -
specifically, how far they adopt human-induced visual biases through multimodal
fusion, or whether they simply inherit biases from pure vision models. One
important visual bias is the texture vs. shape bias, or the dominance of local
over global information. In this paper, we study this bias in a wide range of
popular VLMs. Interestingly, we find that VLMs are often more shape-biased than
their vision encoders, indicating that visual biases are modulated to some
extent through text in multimodal models. If text does indeed influence visual
biases, this suggests that we may be able to steer visual biases not just
through visual input but also through language: a hypothesis that we confirm
through extensive experiments. For instance, we are able to steer shape bias
from as low as 49% to as high as 72% through prompting alone. For now, the
strong human bias towards shape (96%) remains out of reach for all tested VLMs.
Deep neural networks are susceptible to adversarial attacks and common
corruptions, which undermine their robustness. In order to enhance model
resilience against such challenges, Adversarial Training (AT) has emerged as a
prominent solution. Nevertheless, adversarial robustness is often attained at
the expense of model fairness during AT, i.e., disparity in class-wise
robustness of the model. While distinctive classes become more robust towards
such adversaries, hard to detect classes suffer. Recently, research has focused
on improving model fairness specifically for perturbed images, overlooking the
accuracy of the most likely non-perturbed data. Additionally, despite their
robustness against the adversaries encountered during model training,
state-of-the-art adversarial trained models have difficulty maintaining
robustness and fairness when confronted with diverse adversarial threats or
common corruptions. In this work, we address the above concerns by introducing
a novel approach called Fair Targeted Adversarial Training (FAIR-TAT). We show
that using targeted adversarial attacks for adversarial training (instead of
untargeted attacks) can allow for more favorable trade-offs with respect to
adversarial fairness. Empirical results validate the efficacy of our approach.
Deep learning models have proven to be successful in a wide
range of machine learning tasks. Yet, they are often highly sensitive to
perturbations on the input data which can lead to incorrect decisions
with high confidence, hampering their deployment for practical
use-cases. Thus, finding architectures that are (more) robust against
perturbations has received much attention in recent years. Just like the
search for well-performing architectures in terms of clean accuracy,
this usually involves a tedious trial-and-error process with one
additional challenge: the evaluation of a network's robustness is
significantly more expensive than its evaluation for clean accuracy.
Thus, the aim of this paper is to facilitate better streamlined research
on architectural design choices with respect to their impact on
robustness as well as, for example, the evaluation of surrogate measures
for robustness. We therefore borrow one of the most commonly considered
search spaces for neural architecture search for image classification,
NAS-Bench-201, which contains a manageable size of 6466 non-isomorphic
network designs. We evaluate all these networks on a range of common
adversarial attacks and corruption types and introduce a database on
neural architecture design and robustness evaluations. We further
present three exemplary use cases of this dataset, in which we (i)
benchmark robustness measurements based on Jacobian and Hessian matrices
for their robustness predictability, (ii) perform neural architecture
search on robust accuracies, and (iii) provide an initial analysis of
how architectural design choices affect robustness. We find that
carefully crafting the topology of a network can have substantial impact
on its robustness, where networks with the same parameter count range in
mean adversarial robust accuracy from 20%-41%.
The minimum cost multicut problem is the NP-hard/APX-hard combinatorial
optimization problem of partitioning a real-valued edge-weighted graph such as
to minimize the total cost of the partition. While graph convolutional neural
networks (GNN) have proven to be promising in the context of combinatorial
optimization, most of them are only tailored to or tested on positive-valued
edge weights, i.e. they do not comply to the nature of the multicut problem. We
therefore adapt various GNN architectures including Graph Convolutional
Networks, Signed Graph Convolutional Networks and Graph Isomorphic Networks to
facilitate the efficient encoding of real-valued edge costs. Moreover, we
employ a reformulation of the multicut ILP constraints to a polynomial program
as loss function that allows to learn feasible multicut solutions in a scalable
way. Thus, we provide the first approach towards end-to-end trainable
multicuts. Our findings support that GNN approaches can produce good solutions
in practice while providing lower computation times and largely improved
scalability compared to LP solvers and optimized heuristics, especially when
considering large instances.
The Minimum Cost Multicut Problem (MP) is a popular way for obtaining a graph
decomposition by optimizing binary edge labels over edge costs. While the
formulation of a MP from independently estimated costs per edge is highly
flexible and intuitive, solving the MP is NP-hard and time-expensive. As a
remedy, recent work proposed to predict edge probabilities with awareness to
potential conflicts by incorporating cycle constraints in the prediction
process. We argue that such formulation, while providing a first step towards
end-to-end learnable edge weights, is suboptimal, since it is built upon a
loose relaxation of the MP. We therefore propose an adaptive CRF that allows to
progressively consider more violated constraints and, in consequence, to issue
solutions with higher validity. Experiments on the BSDS500 benchmark for
natural image segmentation as well as on electron microscopic recordings show
that our approach yields more precise edge detection and image segmentation.