K-Means using kd-trees
Jump to navigation
Jump to search
Datasets
- corel: SIFT descriptors from the COREL dataset (/common/Image_Datasets/COREL/).
- cards: SIFT descriptors from the playing cards dataset (/common/Image_Datasets/PlayingCards/crops/).
- Note that the cards have a lot of recurrent features, and thus have many tight clusters in SIFT space.
- random: Just random points
- Note that since the SIFT descriptors are normalized, the random points should be so too.
Basic Idea for Tests
- Methods to try:
- Marco's K-means (deterministic)
- K-means with kd-trees (using FLANN)
- Hierarchical K-Means?
- Also look at the convergence of the mean square error. (how many k-means iterations are needed).
- how does the kd-tree method converge?
- what is the tradeoff between number of trees and checks to time for kd-tree for different vocabulary/training sets?
Measurements:
- Mean squared error to assigned centers (distribution).
- Box-plots of distance to nearest center vs some test parameter.
- how does this distribution compare to randomly chosen points as centers. (i.e. a sanity check).
Initial Testing
This is the first round of testing, where I try to determine what results to expect.
Parameters
- M = 3000 words
- N = 30000 points
- Deterministic: Marco's k-means implementation with 7 iterations.
- Approximate: kd-tree with 8 trees, 52 checks and 7 iterations.
- NOTE: there is a problem here which Pietro pointed out: the random points haven't been normalized, which explains why they are shifted on the x-axis.
Results
- The bottom diagram is the same as the top one but with a logarithmic y-axis.
- Note that the distance on the x-axis is really the squared distance of the point to the center of the assigned word.
- Try a sample of the CARD descriptors on vocabularies built with flat K-means using either the COREL, CARD or RANDOM datasets.
- Also a sample of COREL descriptors on vocabularies built from COREL and CARD datasets.
Optimal no. of trees and checks for AKM
Datasets
- Try the COREL dataset.
Parameters
- K = 3000 words
- N = 30,000 points
- checks = {10, 30} recursive steps
- trees = {1,2,4,8,16,31}
Measure
- Time spent per iteration.
- Cluster diameter (90th percentile) vs. cluster size (box-plot)
- Cluster diameter vs K
- Cluster diameter distribution
Results
These results were produced by the test_kdtree_trees.m script in /common/welinder/benchmarking/kmeans_kd-tree.
Distance to assigned cluster center: BOXPLOT
- Squared distances to assigned cluster centers (of training data), after running 10 iterations with the approximate kd-trees K-mean algorithm.
- The columns show the results for different numbers of back-tracking steps.
- The upper row show the distances to the nearest neighbors as assigned by the approximate FLANN code, and the bottom row show the distance to the real nearest neighbor.
Average distances to assigned centers
- Same plot as the previous one, but this time only plotting the mean values.
- Left column: Distances of dataset points to approximate nearest neighbors (NNs) using FLANN
- Middle column: Distances of dataset points to real nearest neighbors (FLANN was only used to produce the centers, not to find the final assignment).
- Right column: Distances of points from a 10,000 point test set to their real nearest neighbors.
- The bottom row is the same as the upper row, but using loglog plots instead.
- NOTE: using the median instead of the mean gives very similar results
Sanity check with 100 centers
- Same plot as the previous one, but with K = 100.
- Note especially the left-most column: it looks like we're not overfitting with high no. of checks anymore. This is because it is much harder to overfit with 100 centers.
Distribution of distances to assigned centers
- Distribution of distances using 8 checks. More information about each data point than the previous plot (which only showed the mean).
- See descriptions above for more info.
- Lower plots are simply loglog plots of the upper plots.
- Same but using 32 checks. Notice two things:
- How much better result you get by increasing the no. of trees when using few checks, and equivalently, the no. trees matters little if you have many checks.
- That the distribution of the test points looks the same, no matter how many checks and trees.
Error convergence
- Error convergence in the approximate K-means algorithm for different number of checks and trees.
Speed
- Time taken per iteration in the K-means algorithm for different steps, using different number of trees and checks.
- Left: time taken to build index (sanity check -- no. of checks shouldn't matter as they are not part of index building)
- Middle: time taken to search for nearest neighbors using the index
- Right: total time per iteration (including the time to calculate new cluster centers from nearest neighbors found)