Skip to content

fast top-k similarity computation in matlab

When pre-computing a similarity matrix, there there is a large cost in both time and memory for computing pairwise similarity between every sample, O(N^2).  There are two main ways to slightly reduce the cost of either one of these demands...

  • Approximate Hashing: Hashing projects a set of features into a lower dimensional representation while simultaneously spreading out the samples.   Hash strategies can chosen to either find exact copies, as is common for traditional algorithms like MD5 and SHA1, or for approximate nearest-neighbor indexing, like LSH.  These algorithms provide effecient approximations (see Voronoi diagram below), but the query is issued at query time making similarity score aggregation across many samples much slower (i.e. finding a list of images similar to a set of other images can be slow).
    Approximate Voronoi diagram of a set of points. Notice the blended colors in the fuzzy boundary of the Voronoi cells. (from wikipedia, http://en.wikipedia.org/wiki/Voronoi_diagram)

    Approximate Voronoi diagram of a set of points. Notice the blended colors in the fuzzy boundary of the Voronoi cells. (http://en.wikipedia.org/wiki/Voronoi_diagram)

  • Precomputed Tables: Precomputing pair-wise similarity in some distance metric (usually L2 or simple Euclidean) is a simple way to have fast similarity look-up with some cost for additional storage of the look-up table.  While this method requires a lot of resources to save the pair-wise distances, it is only limited by the speed of your database lookup process.

In most work, resources are easier to acquire than additional processing power, so I currently prefer to use precomputed tables.  If you look closely at most search applications like whether the user really cares what the 1000000th most similar items is to the query it's probable that they don't.  This is the case for applications that are trying to find the closest resturant to a geographic location (using GPS to search the yellow pages) as well as most image-based search operations.  Thus, it's likely that when computing this table, a full depth (i.e. compare every image to every other image) analysis is not necessary at query time.

Motivated by the discussion above, top-K indexing takes advantage of the fact that only the top-K (or best K results).  However, we must also be mindful of the fact that not all similarity computations may be able to be stored in memory at once.  For example, if you have 20 million items, it may not be resource efficient to store an entire set of 20 million x 20 million result in memory while you search for the top-K items.  The solution?  Break apart the large N x N matrix (here 20 million) into small chunks which can be more easily analyzed (step 1).  After similarity computation, save the sparse top-K matrix (step 2) from both the combined last matrix chunk and everything you've seen before (step 3).  In the image below, I illustrated the matrix in step 2 as how I save it on disk: two separate and simple lists containing the score and index of the new top-K.

Sparse top-K similarity matrix.

Sparse top-K similarity matrix.

Sure, this method is has a slight increase in time complexity because you're constant sorting and storing new results for the top-K process, but overall it should run much faster because of the reduced memory requirements.

Also attached is an example matlab script that contains code to each of the steps defined above.  Hopefully this will help solve at least one part of the problem when indexing pair-wise similarity of a very large dataset.

Example source code: topk.tar.gz

Post a Comment

Your email is never published nor shared. Required fields are marked *
*
*
CAPTCHA Image
*