Determining the similarity between images is a fundamental step in many applications, such as image categorization, image labeling and image retrieval. Automatic methods for similarity estimation often fall short when semantic context is required for the task, raising the need for human judgment. Such judgments can be collected via crowdsourcing techniques, based on tasks posed to web users. However, to allow the estimation of image similarities in reasonable time and cost, the generation of tasks to the crowd must be done in a careful manner. We observe that distances within local neighborhoods provide valuable information that allows a quick and accurate construction of the global similarity metric. This key observation leads to a solution based on clustering tasks, comparing relatively similar images. In each query, crowd members cluster a small set of images into bins. The results yield many relative similarities between images, which are used to construct a global image similarity metric. This metric is progressively refined, and serves to generate finer, more local queries in subsequent iterations.We demonstrate the effectiveness of our method on datasets where ground truth is available, and on a collection of images where semantic similarities cannot be quantified. In particular, we show that our method outperforms alternative baseline approaches, and prove the usefulness of clustering queries, and of our progressive refinement process.
In recent years, there have been many advances in image capturing capabilities of mobile devices, encouraging end users to capture more images of higher quality. As a result, there is an abundance of constantly growing collections of images, both on personal computers and on websites such as Facebook, Flickr and Instagram. Such vast collections require efficient methods for image categorization, imagelabeling, and in particular image retrieval, which allows users to quickly locate an image suitable for their needs. These methods necessarily rely on the availability of pairwise similarities between images in the collection.
It is extremely hard to define a distance metric that would capture well the intuitive or semantic similarity between images. State-of-the-art analytical methods for computing such a metric fall short when similarities are derived from a broad semantic context. These may include elusive relations such as a similar emotion or sensation evoked by the images (e.g., images that convey “fear” or “comfort”); images of things which are semantically related (e.g., different types of garden furniture); likeness between the photographed people; and so on. Consider, for instance, the similarity between the movie posters in Fig. 1. Identifying such similarities is usually easily done by a human observer, but pose a hard computational problem nonetheless.
The natural solution is thus gathering information about semantic similarities between images from people, for example using a crowdsourcing technique.1 This approach was taken in recent work [9,13] to collect style similarity measures. The typical comparison task that the crowd performs is of the following form: given three images A, B, and C, choose whether A is more similar to B or to C (a triplet query). Assuming consistent query responses, querying every image triplet yields the full relative similarity metric over the set of images. However, the number of triplets is prohibitively large. Thus, typically only a sample of the triplets are queried and the rest are estimated based on extracted image features [9,13].
Another challenge in this respect is that people often need context to perform comparison tasks. For example, consider the triplet in Fig. 2. Is the image of a bridge in London (b) more similar to another image of a different bridge in London from a different angle (c) or to an image of a Parisian bridge from the same angle (a)? In a larger context, it often becomes clearer which option is more reasonable, e.g, in the context of Fig. 3 image (b) is more similar to (c) than (a).
In this work, we propose an alternative approach for learning image similarities based on clustering queries posed to the crowd. Instead of queries of three images, crowd members are given a small set of images and are asked to cluster them into bins of similar images using a drag-and-drop graphical UI (see Fig. 4). While a single clustering task requires more effort than comparing three images, our approach has two important advantages. First, the results of a single clustering task provide a great deal of information that is equivalent to many triplet comparison tasks: images placed in the same bin are considered closer to one another than to images in other bins. Second, each query provides crowd members with additional context that assists them in performing a more faithful and meaningful comparison.
A key observation of this work is that a similarity metric can be constructed more efficiently by performing comparisons on similar images rather than non-similar ones. This is true in particular in the context of semantic similarities, where local similarities are often more meaningful. Following this observation, we develop a novel, adaptive algorithm that aims to generate queries that are as local as possible. The challenge here is that similarities are unknown in advance. Thus, our algorithm works iteratively. At each phase, we generate and pose clustering queries to the crowd. As information is collected, we progressively refine the queries to focus on similar images in a narrower local neighborhood. Local similarity comparisons are embedded in Euclidian space to obtain a refined estimation for the global similarity metric. This refined metric is then leveraged for computing more locally focused queries in the next phase. This progressive method efficiently converges to a meaningful similarity estimate.
Evaluation and experimental study To test the efficiency of our approach, we implement our technique in a prototype system, and use it to conduct a thorough experimental study, with both synthetic and real crowd data. First, we test our technique over two image datasets where the ground truth is known, examine the results and compare them to a baseline approach that uses the same number of queries but chooses them randomly. Second, we compute the k-NN images for real-world image datasets, where the ground truth is unknown, and evaluate the results manually. Last, we study the effect of parameters such as the number of phases and queries in a series of synthetic experiments. Our experimental results prove the efficiency of our approach for computing semantic image similarity based solely on the answers of the crowd, while using a relatively small number of clustering queries.