Know2Look: Commonsense Knowledge for Visual Search
Overview
With the rise in popularity of social media, images accompanied by contextual text form a huge section of the web. However, search and retrieval of images are still largely dependent on solely textual cues. Although visual cues have started to gain focus, the imperfection in object/scene detection do not lead to significantly improved results. We hypothesize that the use of background commonsense knowledge on query terms can significantly aid in retrieval of documents with associated images. To this end we deploy three different modalities - text, visual cues, and commonsense knowledge pertaining to the query - as a recipe for efficient search and retrieval. Know2Look is an image retrieval framework that portrays the ensemble effect of these three noisy components for improved image search over conventional text-based approaches.
Approach
Our method is based on statistical language models on unigram and bigram textual features. We use visual features in the form of object classes (and their WordNet hypernyms) detected by LSDA object detection algorithm. Our commonsense knowledge features are OpenIE (subject, predicate, object) triples acquired from Wikipedia documents.
Preliminary evaluation results on a small benchmark of 20 queries show promising performance of Know2Look over baseline (conventional Google search).
Datasets
- Images and corresponding captions pertaining to domain "tourism" that was used for the evaluation of Know2Look were collected from four different datasets - Flickr 30K, Pascal Sentence Dataset, SBU Captioned Photo Dataset, and MSCOCO.
- OpenIE commonsense knowledge triples used for query expansion were extracted from Wikipedia documents.
- Query benchmark for evaluation is constructed from Flickr co-occurrence tags.