Recognition of Kitchen Activities in Videos

Marcus Rohrbach

Recognition of Kitchen Activities in Videos

The recognition of human activities in video has been receiving increasing interest in recent years. Activity recognition has a wide range of applications, including human-computer interaction, surveillance, intelligent environments, automatic extraction of knowledge from (internet) videos, and support of disabled people. However, at the same time, this research area is still confronted with a large number of challenges, such as complex human motions and interactions, very diverse activities, limited image resolution and quality as well as limited observability.

Fine-grained activities in a kitchen scenario

The current focus of many scientific studies is to distinguish very diverse activities such as running, swimming or drinking in videos with limited quality. In contrast to this, many assistance systems are required to distinguish very similar activities, e.g., it is important for a blind person to know if an opponent wants to shake hands or steal your handbag. In our work, we want to examine this problem more closely and have selected the “kitchen” scenario. In this scenario, the complexity of activities can be varied from very simple actions such as “peeling a carrot” to complex activities of several people preparing a complete meal or dish.

Figure 1: Examples of fine-grained kitchen activities

Figure 2: Trajectories of joints for activity recognition

For this research project, we built a fully functional kitchen and equipped it with several cameras [Figure 1a]. In a first dataset, we recorded twelve subjects while they prepared a diverse set of dishes over a total of eight hours of high-resolution video. On the published dataset, we evaluated different approaches for activity recognition to distinguish fine-grained activities such as cut, dice, squeeze, peel, or wash [Figure 1 b-h]. One approach estimates the trajectories of joints and learns the differences between the activities [Figure 2]. A second approach extracts trajectories of moving points in the entire video and computes image and video descriptors of moving elements. In comparison, the second approach achieves much higher performance, as the approach based on joint-trajectories does not use any visual information such as color or shape. Overall, we found that especially very similar, fine-grained activities such as “cut slices” and “cut stripes” are very difficult to distinguish from one another.

 

Recognition of composite activities with attributes and scripts

The previously described fine-grained activities are in most cases composed to more complex composite activities, e.g., the preparation of a dish in the kitchen scenario. Learning these composite activities is difficult because the same dish can be prepared very differently by different people, and it is nearly impossible to observe all possible combinations in the training data. We attack this problem by representing the composite activities with attributes; in our scenario, these are activities, ingredients, and kitchen tools. The representation as attributes allows recognizing new variants or completely unseen composite activities. To determine possible variants and attributes of unseen composite activities, we use textual descriptions (scripts), which can be easily collected, e.g., via crowd sourcing.

Figure 3: Composite activities represented as attributes which can be transferred using textual script information

Marcus Rohrbach

DEPT. 2 Computer Vision and Multimodal Computing
Phone +49 681 9325-1206
Email rohrbach@mpi-inf.mpg.de
Internethttp://www.d2.mpi-inf.mpg.de/nlp4vision