Pollen Detector: Benchmarking

From Vision Wiki
Jump to navigation Jump to search


This page is intended to provide benchmarking for the pollen classification MATLAB function pollenclassify() as it stands as of Oct 2010.

There are step-by-step examples of how to use the code, so hopefully this will also serve as an introductory tutorial.

- Greg Griffin






Overview Of The Algorithm


Stage 1
Stage 2


Stage 1: Expert labels pollen grains (red boxes) using the MATLAB labelGUI program. This is used as the ground-truth for testing.


Stage 2: Smooth and threshold (blue contours) to identifty large and/or dark objects.


Stage 3
Stage 4

Stage 3: Remove contours whose radius, brightness or eccentricity are smaller or larger than 99% of the contours in the training set (grey). Remaining pollen (blue boxes) are candidates.


Stage 4: Evaluate contour shape + brightness parameters of candidates using a nearest neighbor model. Those with no likely candidates besides clutter are dropped (gray).

Stage 5


Stage 5: The above shape+brightness model is combined with the appearance model to choose the final classification (green).


Flowchart summarizing the algorithm described above


Shape+Brightness Model Details

Appearance Model Details

The Datasets

To try and avoid confusion, I'll refer to what I think of as 3 distinct datasets located in /common/greg/pollen/data:

"Original" Dataset
Original slides used to train the pollen detector, mostly obtained in 2007
"Transitional" Dataset
Slides scanned in 2008, 2009 and early 2010 before switching to the new camera
"New" Dataset
As of October we have 6 days worth of slides scanned with a new Retiga 4000R firewire camera and new micromanager software. They are in /common/greg/pollen/data/2010/RSAQMD.

Benchmarking the Original Dataset with itself is a bit like benchmarking the Caltech-101 using all Caltech-101 images. The New Dataset had some differences which I'll point out below. So it's more like training on the Caltech-101 and testing on, say, the Pascal Image Dataset: a more challenging proposition.




Benchmarking



Training: Original Dataset

Here's the complete list of everything you need to do to train the system:

Copy Training Images To Scratch

This is a useful first step so that we don't have different people analyzing data in the same place, and inadvertently stepping on each other's toes. You don't have to use the scratch drive: anyplace will do. For the purposes of this analysis we'll refer to this as $ROOT, even though I'm really using

ROOT=/scratch/greg/pollen/benchmarking

We start with a set of images cropped by experts using labelGUI back in 2007. Over the years we've fixed errors and removed a lot of images that seemed to be bad examples of their visual category. The resulting data is in /common/greg/pollen/scratch/images. We'll start by copy this over:

% mkdir -p $ROOT/images
% cp -R /common/greg/pollen/images/* $ROOT/images
% cd $ROOT/images
% rm -rf pecan crepemyrtle

The reason for removing pecan and crepemyrtle is that they have only 12 and 16 training examples each (as compared to hundreds or even thousands in some categories).

Generate The Appearance Model

Now generate sift features for all the images, storing the results in $ROOT/sift. First start matlab:

% matlab

At the matlab prompt,

ROOT='/scratch/greg/pollen/benchmarking';
cd(ROOT)
cd images
siftpollen('../sift',getcategories);

Optional: on a machine with multiple processors, you can speed siftpollen up by a factor of ~8 or so by typing "matlabpool" first.

When siftpollen is done, go to where the SIFT features are and generate the appearance model, storing the results in $ROOT/spm. The appearance model is based on the Spatial Pyramid Matching (SPM) algorithm of Lazebnik et. al. (CVPR 2006).

cd ../sift
histpollen('../spm',getcategories,100,100,200);
cd ../spm
trainpollen(100,100,200);

The 3 numbers above tell trainpollen to choose 100 training and 100 test images per category, and to cluster the SIFT features into a vocabulary of 200 words (just as Lazebnik et. al did).

The resulting histograms, match kernel and fully trained SVM are stored as --- , --- and --- (respectively).

Generate the Shape+Brightness Model

Because it matches SIFT features only, SPM is insensitive to scale and brightness. These features are captured in a second model, which we will now train:

cd(ROOT);
cd images
makestats(getcategories,200);



Testing: Original Dataset, Pre-cropped Images

First let's test stages 4 and 5. How well can the computer classify images that have already had boxes drawn around them? This is equivalent to assuming stages 2 and 3 above perform perfectly.

Then we'll be ready to find the true end-to-end performance of the complete system by classifying pollen with classifypollen() - which is what we ultimately care about.

Appearance Model

Appearance Oct12 2010.png

I'll start by testing the appearance model by itself. This is the model that classifies each cropped pollen grain candidate using Spatial Pyramid Matching on a dense grid of SIFT features. In other words, it knows nothing about the absolute brightness or size of any image.

The reason for starting with this model is that we have mechanical turk data on how humans perform on the exact same task. So it makes for an interesting comparison.

Testing the appearance SVM works like this:

cd(ROOT);
cd spm;
testpollen(100,100,200);

which gives you the confusion matrix shown at the right.

Appearance Oct12 2010b.png

One limitation here is that there are 100 training examples per category, but the categories bubble, chenopod, chinelm, eucalyp, fungus, jacaran, liqamber, palm, and poplar all have less than 100 examples per category. These correspond to the empty rows in the confusion matrix. Excluding these categories,

over100=[1 2 3 7 8 9 12 13 16 ...
         17 18 20 22 23 24];
svm=testpollen(100,100,200, ...
               'possible',over100, ...
               'clist',over100);
viewconfuse(svm,'labelmode',true);

gives the confusion matrix to the right. The mean of the diagonal is 67.3%.

Appearance Oct12 2010c.png


Actual performance is slightly better than that, because confusing clutter1 with clutter2 is not a mistake. These are just two different collections of clutter examples, intended to expand the variety of clutter the algorithm can deal with. Combining clutter rows and columns together:

svm2= fixsvm(svm);
viewconfuse(svm2,'labelmode',true);

We get an overall appearance model performance on these 13 categories + clutter is 68.8%.

Appearance Model: Outperforming Human Test Subjects?

An interesting side note: while the above appearance model might not outperform a pollen expert, it seems to outperform most non-experts.

In 2008 Merrielle Spain and I used the Amazon Mechanical Turk to compare the computer's appearance-based classification errors with human classification errors on a slightly simplifed version of the above benchmark: 6 categories + confusion. Results

  • Human performance: 65%
  • Computer performance: 80%
  • Expert performance: 100%*

The computer outperformed the humans by 15%. And not just any humans: the best 3 humans chosen out of 12 test subjects. It is also fascinating to note that the computer and the humans are confused about similar categories.

Mechanical Turk results
Computer results
Idealized expert


Details: human test subjects had the exact same data that the computer did, except that they had only 20 images per category instead of 100 (as a practical matter, it was impossible to show the humans 100 examples per category on the same screen at the same time). To weed out bad test subjects, we averaged only the 3 out of 12 test subjects with the best individual performance.

*Food for thought: the ground truth is established by experts, which assumes apriori that the experts have perfect performance. This is by no means certain. In fact, we have observed lots of interesting examples of the computer "mislabeling" pollen, causing the expert to re-think their original classification and agree with the computer..

Shape+Brightness Model

Now we'll test classification performance when using only the shape + brightness statistics (without the appearance model). As above, the classes with less than 100 images per category are removed so that we have something left over to get reliable testing statistics.

Shapebrigthness Oct19 2010.png

Unlike the appearance model which uses a kernel-based SVM, the shape + brightness model is a nearest neighbor model. You can test as follows:

cd(ROOT);
cd images
model=teststats('stats.mat','ntrain',100,...
                   'clist',over100);
model2= fixsvm(model);
viewconfuse(model2,'labelmode',true);

Overall performance: 69.1%.

Your performance numbers may be a few percent higher or lower, since the choice of training and testing sets are randomized each time you run teststats. But they should not be fundamentally different from what you see to the right.

Shape+Brightness vs. Appearance Models: Comparison

So the overall classification performance of the shape+brightness model is almost identical to that of the appearance model.

How about category by category: how do they compare?

Assuming you've been following along so far, you should have the variables "model2" and "svm2" in your workspace memory. Let's compare the diagonals of the confusion matrices:

Comparison Oct19 2010.png
perf_sb= diag(model2.conf);
perf_a=  diag(svm2.conf);
plot(perf_sb,perf_a,'+')
categories= svm2.categories;
text(perf_sb,perf_a,categories,...
        'verticalalignment','bottom',...
        'horizontalalignment','right');
hold on;
plot([0 1],[0 1],'r--'); 
grid
labelargs= {'fontsize', 13};
xlabel('shape+brightness performance',...
            labelargs{:});
ylabel('appearance performance',...
          labelargs{:});

Those categories above the red line perform better in the appearance model, while those below are classified better with shape and brightness.

As it turns out, most categories seem to be pretty close to the line e.g. shape and appearance are equally distinctive.









DO NOT GO BEYOND THIS POINT YET-

TESTING SECTION UNDER CONSTRUCTION

Overall Performance

To test the full classification system, we need to find those files that were not used in the training phase, ie. the union of those training files used in the Shape+Brightness and Appearance models:

cd(ROOT)
cd spm
hist=load('hist100_100_200.mat');
cd ../images
stats= load('stats.mat');
ftrain1= strrep(hist.ftrain,'.mat','.jpg');
ftrain2 =stats.allfiles;
ftrain= union(ftrain1,ftrain2);

These are the names of the images of cropped pollen grains. What we need is the names of the microscope slide images from which they were cropped. The mapping from cropped image names to slide names is done by crop2slide:

cd /common/greg/pollen/data/training
strain= crop2slide(ftrain);
strain= unique(slidenames);

So we now have a list of slides that are off-limits for testing purposes.


for i=1:length(ftrain), 
   indx=first(find(ftrain{i}=='_')); 
   ftrain{i}= strrep(ftrain{i}(indx+1:end),'.mat','.tif'); 
   ftrain{i}= strrep(ftrain{i},'_',''); 
end

Now go to where those microscope slides are, and get a complete list of all slides not used during testing:

cd /common/greg/pollen/data/training
fall= pickfiles(getcategories); 
fall= strrep(strrep(fall,'.gz',''),'./','');
fall= findstrings(fall,'.tif');
ftrain= strrep(ftrain,'.jpg','.tif');





Other topics

Clutter vs. Non-Clutter

Work In Progress

In MATLAB:

cd ~/pollen/images
makestats3(getcategories,100);

Work In Progress

cd ~/pollen/data/training
label= mergelabels;
cd ~/pollen/data/testing
save label.mat label

Keep only images with pollen in them, that aren't in training set:

files= findpollen();
load /common/greg/pollen/spm_v1/24x24_1014/200/hist050_010_010.mat
for i=1:length(ftrain), 
   indx=first(find(ftrain{i}=='_')); 
   ftrain{i}= strrep(ftrain{i}(indx+1:end),'.mat','.tif'); 
   ftrain{i}= strrep(ftrain{i},'_',''); 
end
[files,ftrain]= apply(@sort,files,ftrain);
ftest= setdiff(files,ftrain);
[length(ftrain) length(files) length(ftest)]

This leaves 7068 files in ftest to be classified:

 pollen= classifypollen(ftest,'verbose',0);

Testing