Abstract
Abstract
Humans are at the centre of a significant amount of research in computer vision.
Endowing machines with the ability to perceive people from visual data is an immense
scientific challenge with a high degree of direct practical relevance. Success in automatic
perception can be measured at different levels of abstraction, and this will depend on
which intelligent behaviour we are trying to replicate: the ability to localise persons in
an image or in the environment, understanding how persons are moving at the skeleton
and at the surface level, interpreting their interactions with the environment including
with other people, and perhaps even anticipating future actions. In this thesis we tackle
different sub-problems of the broad research area referred to as "looking at people",
aiming to perceive humans in images at different levels of granularity.
We start with bounding box-level pedestrian detection: We present a retrospective
analysis of methods published in the decade preceding our work, identifying various
strands of research that have advanced the state of the art. With quantitative exper-
iments, we demonstrate the critical role of developing better feature representations
and having the right training distribution. We then contribute two methods based
on the insights derived from our analysis: one that combines the strongest aspects of
past detectors and another that focuses purely on learning representations. The latter
method outperforms more complicated approaches, especially those based on hand-
crafted features. We conclude our work on pedestrian detection with a forward-looking
analysis that maps out potential avenues for future research.
We then turn to pixel-level methods: Perceiving humans requires us to both separate
them precisely from the background and identify their surroundings. To this end, we
introduce Cityscapes, a large-scale dataset for street scene understanding. This has since
established itself as a go-to benchmark for segmentation and detection. We additionally
develop methods that relax the requirement for expensive pixel-level annotations, focusing
on the task of boundary detection, i.e. identifying the outlines of relevant objects and
surfaces. Next, we make the jump from pixels to 3D surfaces, from localising and
labelling to fine-grained spatial understanding. We contribute a method for recovering
3D human shape and pose, which marries the advantages of learning-based and model-
based approaches.
We conclude the thesis with a detailed discussion of benchmarking practices in
computer vision. Among other things, we argue that the design of future datasets
should be driven by the general goal of combinatorial robustness besides task-specific
considerations.
BibTeX
@phdthesis{Omranphd2021, TITLE = {From Pixels to People}, AUTHOR = {Omran, Mohamed}, LANGUAGE = {eng}, URL = {nbn:de:bsz:291--ds-366053}, DOI = {10.22028/D291-36605}, SCHOOL = {Universit{\"a}t des Saarlandes}, ADDRESS = {Saarbr{\"u}cken}, YEAR = {2021}, MARGINALMARK = {$\bullet$}, DATE = {2021}, ABSTRACT = {Abstract<br>Humans are at the centre of a significant amount of research in computer vision.<br>Endowing machines with the ability to perceive people from visual data is an immense<br>scientific challenge with a high degree of direct practical relevance. Success in automatic<br>perception can be measured at different levels of abstraction, and this will depend on<br>which intelligent behaviour we are trying to replicate: the ability to localise persons in<br>an image or in the environment, understanding how persons are moving at the skeleton<br>and at the surface level, interpreting their interactions with the environment including<br>with other people, and perhaps even anticipating future actions. In this thesis we tackle<br>different sub-problems of the broad research area referred to as "looking at people",<br>aiming to perceive humans in images at different levels of granularity.<br>We start with bounding box-level pedestrian detection: We present a retrospective<br>analysis of methods published in the decade preceding our work, identifying various<br>strands of research that have advanced the state of the art. With quantitative exper-<br>iments, we demonstrate the critical role of developing better feature representations<br>and having the right training distribution. We then contribute two methods based<br>on the insights derived from our analysis: one that combines the strongest aspects of<br>past detectors and another that focuses purely on learning representations. The latter<br>method outperforms more complicated approaches, especially those based on hand-<br>crafted features. We conclude our work on pedestrian detection with a forward-looking<br>analysis that maps out potential avenues for future research.<br>We then turn to pixel-level methods: Perceiving humans requires us to both separate<br>them precisely from the background and identify their surroundings. To this end, we<br>introduce Cityscapes, a large-scale dataset for street scene understanding. This has since<br>established itself as a go-to benchmark for segmentation and detection. We additionally<br>develop methods that relax the requirement for expensive pixel-level annotations, focusing<br>on the task of boundary detection, i.e. identifying the outlines of relevant objects and<br>surfaces. Next, we make the jump from pixels to 3D surfaces, from localising and<br>labelling to fine-grained spatial understanding. We contribute a method for recovering<br>3D human shape and pose, which marries the advantages of learning-based and model-<br>based approaches.<br>We conclude the thesis with a detailed discussion of benchmarking practices in<br>computer vision. Among other things, we argue that the design of future datasets<br>should be driven by the general goal of combinatorial robustness besides task-specific<br>considerations.}, }
Endnote
%0 Thesis %A Omran, Mohamed %Y Schiele, Bernt %A referee: Gall, Jürgen %+ Computer Vision and Machine Learning, MPI for Informatics, Max Planck Society International Max Planck Research School, MPI for Informatics, Max Planck Society Computer Vision and Machine Learning, MPI for Informatics, Max Planck Society Computer Graphics, MPI for Informatics, Max Planck Society %T From Pixels to People : Recovering Location, Shape and Pose of Humans in Images %G eng %U http://hdl.handle.net/21.11116/0000-000A-CDBF-9 %R 10.22028/D291-36605 %U nbn:de:bsz:291--ds-366053 %F OTHER: hdl:20.500.11880/33466 %I Universität des Saarlandes %C Saarbrücken %D 2021 %P 252 p. %V phd %9 phd %X Abstract<br>Humans are at the centre of a significant amount of research in computer vision.<br>Endowing machines with the ability to perceive people from visual data is an immense<br>scientific challenge with a high degree of direct practical relevance. Success in automatic<br>perception can be measured at different levels of abstraction, and this will depend on<br>which intelligent behaviour we are trying to replicate: the ability to localise persons in<br>an image or in the environment, understanding how persons are moving at the skeleton<br>and at the surface level, interpreting their interactions with the environment including<br>with other people, and perhaps even anticipating future actions. In this thesis we tackle<br>different sub-problems of the broad research area referred to as "looking at people",<br>aiming to perceive humans in images at different levels of granularity.<br>We start with bounding box-level pedestrian detection: We present a retrospective<br>analysis of methods published in the decade preceding our work, identifying various<br>strands of research that have advanced the state of the art. With quantitative exper-<br>iments, we demonstrate the critical role of developing better feature representations<br>and having the right training distribution. We then contribute two methods based<br>on the insights derived from our analysis: one that combines the strongest aspects of<br>past detectors and another that focuses purely on learning representations. The latter<br>method outperforms more complicated approaches, especially those based on hand-<br>crafted features. We conclude our work on pedestrian detection with a forward-looking<br>analysis that maps out potential avenues for future research.<br>We then turn to pixel-level methods: Perceiving humans requires us to both separate<br>them precisely from the background and identify their surroundings. To this end, we<br>introduce Cityscapes, a large-scale dataset for street scene understanding. This has since<br>established itself as a go-to benchmark for segmentation and detection. We additionally<br>develop methods that relax the requirement for expensive pixel-level annotations, focusing<br>on the task of boundary detection, i.e. identifying the outlines of relevant objects and<br>surfaces. Next, we make the jump from pixels to 3D surfaces, from localising and<br>labelling to fine-grained spatial understanding. We contribute a method for recovering<br>3D human shape and pose, which marries the advantages of learning-based and model-<br>based approaches.<br>We conclude the thesis with a detailed discussion of benchmarking practices in<br>computer vision. Among other things, we argue that the design of future datasets<br>should be driven by the general goal of combinatorial robustness besides task-specific<br>considerations. %U https://publikationen.sulb.uni-saarland.de/handle/20.500.11880/33466