When precomputing a similarity matrix, there there is a large cost in both time and memory for computing pairwise similarity between every sample, O(N^2). There are two main ways to slightly reduce the cost of either one of these demands...

Approximate Hashing: Hashing projects a set of features into a lower dimensional representation while simultaneously spreading out the samples. Hash strategies can chosen to either find exact copies, as is common for traditional algorithms like MD5 and SHA1, or for approximate nearestneighbor indexing, like LSH. These algorithms provide effecient approximations (see Voronoi diagram below), but the query is issued at query time making similarity score aggregation across many samples much slower (i.e. finding a list of images similar to a set of other images can be slow).  Precomputed Tables: Precomputing pairwise similarity in some distance metric (usually L2 or simple Euclidean) is a simple way to have fast similarity lookup with some cost for additional storage of the lookup table. While this method requires a lot of resources to save the pairwise distances, it is only limited by the speed of your database lookup process.
In most work, resources are easier to acquire than additional processing power, so I currently prefer to use precomputed tables. If you look closely at most search applications like whether the user really cares what the 1000000th most similar items is to the query it's probable that they don't. This is the case for applications that are trying to find the closest resturant to a geographic location (using GPS to search the yellow pages) as well as most imagebased search operations. Thus, it's likely that when computing this table, a full depth (i.e. compare every image to every other image) analysis is not necessary at query time.
Motivated by the discussion above, topK indexing takes advantage of the fact that only the topK (or best K results). However, we must also be mindful of the fact that not all similarity computations may be able to be stored in memory at once. For example, if you have 20 million items, it may not be resource efficient to store an entire set of 20 million x 20 million result in memory while you search for the topK items. The solution? Break apart the large N x N matrix (here 20 million) into small chunks which can be more easily analyzed (step 1). After similarity computation, save the sparse topK matrix (step 2) from both the combined last matrix chunk and everything you've seen before (step 3). In the image below, I illustrated the matrix in step 2 as how I save it on disk: two separate and simple lists containing the score and index of the new topK.
Also attached is an example matlab script that contains code to each of the steps defined above. Hopefully this will help solve at least one part of the problem when indexing pairwise similarity of a very large dataset.
Example source code: topk.tar.gz
Post a Comment