Pollen Detector: Benchmarking
There are step-by-step examples of how to use the code, so hopefully this will also serve as an introductory tutorial.
|
Overview Of The Algorithm
Stage 1: Expert labels pollen grains (red boxes) using the MATLAB labelGUI
program. This is used as the ground-truth for testing.
Stage 2: Smooth and threshold (blue contours) to identifty large and/or dark objects.
Stage 3: Remove contours whose radius, brightness or eccentricity are smaller or larger than 99% of the contours in the training set (grey). Remaining pollen (blue boxes) are candidates.
Stage 4: Evaluate contour shape + brightness parameters of candidates using a nearest neighbor model. Those with no likely candidates besides clutter are dropped (gray).
Stage 5: The above shape+brightness model is combined with the appearance model to choose the final classification (green).
Shape+Brightness Model Details
Appearance Model Details
The Datasets
To try and avoid confusion, I'll refer to what I think of as 3 distinct datasets located in /common/greg/pollen/data
:
- "Original" Dataset
- Original slides used to train the pollen detector, mostly obtained in 2007
- "Transitional" Dataset
- Slides scanned in 2008, 2009 and early 2010 before switching to the new camera
- "New" Dataset
- As of October we have 6 days worth of slides scanned with a new Retiga 4000R firewire camera and new micromanager software. They are in /common/greg/pollen/data/2010/RSAQMD.
Benchmarking the Original Dataset with itself is a bit like benchmarking the Caltech-101 using all Caltech-101 images. The New Dataset had some differences which I'll point out below. So it's more like training on the Caltech-101 and testing on, say, the Pascal Image Dataset: a more challenging proposition.
Benchmarking
Training: Original Dataset
Here's the complete list of everything you need to do to train the system:
Copy Training Images To Scratch
This is a useful first step so that we don't have different people analyzing data in the same place, and inadvertently stepping on each other's toes.
You don't have to use the scratch drive: anyplace will do. For the purposes of this analysis we'll refer to this as
$ROOT
, even though I'm really using
ROOT=/scratch/greg/pollen/benchmarking
We start with a set of images cropped by experts using labelGUI back in 2007. Over the years we've fixed errors and removed
a lot of images that seemed to be bad examples of their visual category. The resulting data is in
/common/greg/pollen/scratch/images
. We'll start by copy this over:
% mkdir -p $ROOT/images % cp -R /common/greg/pollen/images/* $ROOT/images % cd $ROOT/images % rm -rf pecan crepemyrtle
The reason for removing pecan and crepemyrtle is that they have only 12 and 16 training examples each (as compared to hundreds or even thousands in some categories).
Generate The Appearance Model
Now generate sift features for all the images, storing the results in $ROOT/sift. First start matlab:
% matlab
At the matlab prompt,
ROOT='/scratch/greg/pollen/benchmarking'; cd(ROOT) cd images siftpollen('../sift',getcategories);
Optional: on a machine with multiple processors, you can speed siftpollen
up by a factor of ~8 or so by typing "matlabpool" first.
When siftpollen
is done, go to where the SIFT features are and generate
the appearance model, storing the results in $ROOT/spm. The appearance model is based
on the Spatial Pyramid Matching (SPM) algorithm of Lazebnik et. al. (CVPR 2006).
cd ../sift histpollen('../spm',getcategories,100,100,200); cd ../spm trainpollen(100,100,200);
The 3 numbers above tell trainpollen to choose 100 training and 100 test images per category, and to cluster the SIFT features into a vocabulary of 200 words (just as Lazebnik et. al did).
The resulting histograms, match kernel and fully trained SVM are stored as --- , --- and --- (respectively).
Generate the Shape+Brightness Model
Because it matches SIFT features only, SPM is insensitive to scale and brightness. These features are captured in a second model, which we will now train:
cd(ROOT); cd images makestats(getcategories,200);
Testing: Original Dataset, Pre-cropped Images
First let's test stages 4 and 5. How well can the computer classify images that have already had boxes drawn around them? This is equivalent to assuming stages 2 and 3 above perform perfectly.
Then we'll be ready to find the true end-to-end performance of the
complete system by classifying pollen with classifypollen()
-
which is what we ultimately care about.
Appearance Model
I'll start by testing the appearance model by itself. This is the model that classifies each cropped pollen grain candidate using Spatial Pyramid Matching on a dense grid of SIFT features. In other words, it knows nothing about the absolute brightness or size of any image.
The reason for starting with this model is that we have mechanical turk data on how humans perform on the exact same task. So it makes for an interesting comparison.
Testing the appearance SVM works like this:
cd(ROOT); cd spm; testpollen(100,100,200);
which gives you the confusion matrix shown at the right.
One limitation here is that there are 100 training examples per category, but the categories bubble, chenopod, chinelm, eucalyp, fungus, jacaran, liqamber, palm, and poplar all have less than 100 examples per category. These correspond to the empty rows in the confusion matrix. Excluding these categories,
over100=[1 2 3 7 8 9 12 13 16 ... 17 18 20 22 23 24]; svm=testpollen(100,100,200, ... 'possible',over100, ... 'clist',over100); viewconfuse(svm,'labelmode',true);
gives the confusion matrix to the right. The mean of the diagonal is 67.3%.
Actual performance is slightly better than that, because confusing clutter1 with clutter2 is not a mistake. These are just two different collections of clutter examples, intended to expand the variety of clutter the algorithm can deal with. Combining clutter rows and columns together:
svm2= fixsvm(svm); viewconfuse(svm2,'labelmode',true);
We get an overall appearance model performance on these 13 categories + clutter is
68.8%.
Appearance Model: Outperforming Human Test Subjects?
An interesting side note: while the above appearance model might not outperform a pollen expert, it seems to outperform most non-experts.
In 2008 Merrielle Spain and I used the Amazon Mechanical Turk to compare the computer's appearance-based classification errors with human classification errors on a slightly simplifed version of the above benchmark: 6 categories + confusion. Results
- Human performance: 65%
- Computer performance: 80%
- Expert performance: 100%*
The computer outperformed the humans by 15%. And not just any humans: the best 3 humans chosen out of 12 test subjects. It is also fascinating to note that the computer and the humans are confused about similar categories.
Details: human test subjects had the exact same data that the computer did, except that they had only 20 images per category instead of 100 (as a practical matter, it was impossible to show the humans 100 examples per category on the same screen at the same time). To weed out bad test subjects, we averaged only the 3 out of 12 test subjects with the best individual performance.
*Food for thought: the ground truth is established by experts, which assumes apriori that the experts have perfect performance. This is by no means certain. In fact, we have observed lots of interesting examples of the computer "mislabeling" pollen, causing the expert to re-think their original classification and agree with the computer..
Shape+Brightness Model
Now we'll test classification performance when using only the shape + brightness statistics (without the appearance model). As above, the classes with less than 100 images per category are removed so that we have something left over to get reliable testing statistics.
Unlike the appearance model which uses a kernel-based SVM, the shape + brightness model is a nearest neighbor model. You can test as follows:
cd(ROOT); cd images model=teststats('stats.mat','ntrain',100,... 'clist',over100); model2= fixsvm(model); viewconfuse(model2,'labelmode',true);
Overall performance: 69.1%.
Your performance numbers may be a few percent higher or lower, since the choice
of training and testing sets are randomized each time you run teststats
.
But they should not be fundamentally different from what you see to the right.
Shape+Brightness vs. Appearance Models: Comparison
So the overall classification performance of the shape+brightness model is almost identical to that of the appearance model.
How about category by category: how do they compare?
Assuming you've been following along so far, you should have the variables "model2" and "svm2" in your workspace memory. Let's compare the diagonals of the confusion matrices:
perf_sb= diag(model2.conf); perf_a= diag(svm2.conf); plot(perf_sb,perf_a,'+') categories= svm2.categories; text(perf_sb,perf_a,categories,... 'verticalalignment','bottom',... 'horizontalalignment','right'); hold on; plot([0 1],[0 1],'r--'); grid labelargs= {'fontsize', 13}; xlabel('shape+brightness performance',... labelargs{:}); ylabel('appearance performance',... labelargs{:});
Those categories above the red line perform better in the appearance model, while those below are classified better with shape and brightness.
As it turns out, most categories seem to be pretty close to the line e.g. shape and appearance are equally distinctive.
DO NOT GO BEYOND THIS POINT YET- |
Overall Performance
To test the full classification system, we need to find those files that were not used in the training phase, ie. the union of those training files used in the Shape+Brightness and Appearance models:
cd(ROOT) cd spm hist=load('hist100_100_200.mat'); cd ../images stats= load('stats.mat'); ftrain1= strrep(hist.ftrain,'.mat','.jpg'); ftrain2 =stats.allfiles; ftrain= union(ftrain1,ftrain2);
These are the names of the images of cropped pollen grains. What we need
is the names of the microscope slide images from which they were cropped.
The mapping from cropped image names to slide names is done by
crop2slide
:
cd /common/greg/pollen/data/training strain= crop2slide(ftrain); strain= unique(slidenames);
So we now have a list of slides that are off-limits for testing purposes.
for i=1:length(ftrain), indx=first(find(ftrain{i}=='_')); ftrain{i}= strrep(ftrain{i}(indx+1:end),'.mat','.tif'); ftrain{i}= strrep(ftrain{i},'_',''); end
Now go to where those microscope slides are, and get a complete list of all slides not used during testing:
cd /common/greg/pollen/data/training fall= pickfiles(getcategories); fall= strrep(strrep(fall,'.gz',''),'./',''); fall= findstrings(fall,'.tif'); ftrain= strrep(ftrain,'.jpg','.tif');
Other topics
Clutter vs. Non-Clutter
Work In Progress
In MATLAB:
cd ~/pollen/images makestats3(getcategories,100);
Work In Progress
cd ~/pollen/data/training label= mergelabels; cd ~/pollen/data/testing save label.mat label
Keep only images with pollen in them, that aren't in training set:
files= findpollen(); load /common/greg/pollen/spm_v1/24x24_1014/200/hist050_010_010.mat for i=1:length(ftrain), indx=first(find(ftrain{i}=='_')); ftrain{i}= strrep(ftrain{i}(indx+1:end),'.mat','.tif'); ftrain{i}= strrep(ftrain{i},'_',''); end [files,ftrain]= apply(@sort,files,ftrain); ftest= setdiff(files,ftrain); [length(ftrain) length(files) length(ftest)]
This leaves 7068 files in ftest
to be classified:
pollen= classifypollen(ftest,'verbose',0);