Pollen Notebook, Fall 2008

From Vision Wiki
Jump to navigation Jump to search

Back to Greg's Wiki



Oct 7


Evaluating The Appearance Model

confusion matrix for Ntrain=50
cd ~/pollen/mar/sift8;
cats=getcategories;
cats([6 7 10 14 16])=[];
histpollen('../hist8/1',cats,50,50,200);
trainpollen(50,50,200); 
svm1= testpollen(50,50,200);
share pollen_svm_Oct7 svm1

I generate 4 such confusion matrices svm[1234] in the corresponding directories hist8/[1234] and average them to get the confusion matrices shown here:

conf=(svm1.conf+svm2.conf+svm3.conf+svm4.conf)/4;
svm= svm1;
viewconfuse(svm);

Categories eucalyp, fungus, liqamber, palm, poplar were removed because they had less than 100 training examples.

Bottom line:

  • This looks reasonably good to me, especially since we're only using ntrain=50 (the real model uses ntrain=200).
  • If any of the 5 categories that were left out are a problem, maybe makesvm is having some problem with categories smaller than ntrain=200. Possible reasons:
    • Bug in makesvm
    • Expected svm bias, which should be corrected (see variable ntrain_bias in classifypollen.m


Bug Fix

In classifypollen the appearance model wasn't doing anything, due to a problem in getsift if the input img is not of type uint8. See subversion log for details (revision 1877).


March Counts

cd ~/pollen/data/03-25-08; 
files= findpollen(); 
tic; 
[g,t]= classifypollen(files,'displaymode',1,'shape_bias',xxx,'clutter_bias',xxx); 
toc/3600
Counts Oct7.png

With code revision 1878 and the following parameter values:

id machine shape_bias clutter_bias
1 vision315 .25 .20
2 vision305 .40 .20
3 vision303 .25 .40
4 vision302 .40 .40
5 vision301 .25 .60
6 vision308 .40 .60
7 vision107 .25 .80
8 vision121 .40 .80

Variables g[1-8] and t[1-8] can be obtained via take pollen_mar_Oct8. Results:

>> viewcount({'expert','1','2','3','4','5','6','7','8'},t1,g1,g2,g3,g4,g5,g6,g7,g8)
 1        birc:        3      12      14      15      15      13      12       8      14
 2        bubb:       --     306     189     289     180     261     170     247     143
 3        chen:       --       1       0       1       0       1       0       1       0
 4        clut:       --     712     823     844     907     957     990    1031    1058
 5        cypr:        3      20      14      16      15      12      10      11      11
 6        euca:        1       3       1       2       2       0       0       1       1
 7        fung:       --      92      55      58      35      25      21      19      13
 8        gink:        3      83     119      67      96      66      80      64      82
 9        gras:        2      17      24      14      24      18      22      14      23
10        liq.:        3       7       0       4       1       3       1       2       3
11        liqa:       --       7       0       4       1       3       1       2       3
12        mulb:       37      97     141      81     118      66     103      53      85
13         oak:       57      56      73      51      66      50      65      43      52
14        oliv:       --      76      39      58      36      37      32      25      24
15        palm:       --       0       0       0       0       0       0       0       0
16        pine:        8     100      80      85      76      74      71      62      71
17        popl:        7       0       0       0       0       0       0       0       0
18        sage:        8       3       3       2       3       1       4       2       4
19        syca:       13      --      --      --      --      --      --      --      --
20        unkn:       47      --      --      --      --      --      --      --      --
21        waln:        3       7      10       5       9       5       9       6       9


September Counts

Preprocessing August data on vision311 and vision310:

cd ~/pollen/aug/images8; 
siftpollen('../sift8',getcategories,xxx,2);
histpollen('../hist8',getcategories,200,0,200);
trainpollen(200,0,200);

The intended target is the 9-22-08 set, which (I believe?) was actually taken on 8-28-08 (confirm this with Jim). For now I am just creating a chinese elm detector, with the following categories in ~/pollen/aug/images8:

1 bubble    2 chinelm   3 clutter   4 fungus


Things to Try Next

1. Lots of errors at the edges, where the pollen got truncated. Might get better counts by removing those detections that fell near the edge. In principle these images don't overlap... but is this true? (confirm)

2. Now that appearance model bug is fixed, try reinstating the following in classifypollen:

possible= unique([guesses{i} clutter.id]);

3. I still see it picking up a lot of large dim objects. Maybe a good statistic would be mean brightness, ie. total luminosity / total area? A large dim object (e.g. smudge) is more suspicious than a small dim object (e.g. mulberry).


Interesting Test Image

Image 72+/-1 (for the March files I'm running counts on) is really densely clumped. This could be useful for seeing (in a single image) whether my clutter_bias or other parameters are in the right ballpark. I already get the feeling (looking at this) that my clutter_bias may need to be larger than .60


Oct 8

Continuing to Explore Parameter Space

id machine shape_bias clutter_bias
9 vision315 .25 1.2
10 vision305 .40 1.2
11 vision303 .55 1.2
12 vision302 .25 1.8
13 vision301 .40 1.8
14 vision308 .55 1.8
15 vision107 0 1.2
16 vision121 0 1.8
>> viewcount({'expert','16','14','13','12','11','10','9','8','7','6','5','4','3','2','1'},t1,g16,g14,g13,g12,g11,g10,g9,g8,g7,g6,g5,g4,g3,g2,g1)
 1        birc:        3       0       9       4       0      13      12       6      14       8      12      13      15      15      14      12
 2        bubb:       --     225      95     123     155      94     131     207     143     247     170     261     180     289     189     306
 3        chen:       --       0       0       0       0       0       0       0       0       1       0       1       0       1       0       1
 4        clut:       --    1338    1219    1286    1348    1096    1156    1197    1058    1031     990     957     907     844     823     712
 5        cypr:        3       2       4       3       2       6       6       3      11      11      10      12      15      16      14      20
 6        euca:        1       0       0       0       0       0       0       0       1       1       0       0       2       2       1       3
 7        fung:       --       1       0       0       1       2       2       4      13      19      21      25      35      58      55      92
 8        gink:        3       3      41      30      11      71      66      41      82      64      80      66      96      67     119      83
 9        gras:        2       8      24      21      14      33      15      11      23      14      22      18      24      14      24      17
10        liq.:        3       0       0       0       0       0       2       1       3       2       1       3       1       4       0       7
11        liqa:       --       0       0       0       0       0       2       1       3       2       1       3       1       4       0       7
12        mulb:       37       2      43      30      17      87      55      30      85      53     103      66     118      81     141      97
13         oak:       57       0      52      30      13      61      51      30      52      43      65      50      66      51      73      56
14        oliv:       --       1      15       8       3      25      19      13      24      25      32      37      36      58      39      76
15        palm:       --       0       0       0       0       0       0       1       0       0       0       0       0       0       0       0
16        pine:        8      11      76      56      29      87      69      49      71      62      71      74      76      85      80     100
17        popl:        7       0       0       0       0       0       0       0       0       0       0       0       0       0       0       0
18        sage:        8       0       2       1       1       3       2       0       4       2       4       1       3       2       3       3
19        syca:       13      --      --      --      --      --      --      --      --      --      --      --      --      --      --      --
20        unkn:       47      --      --      --      --      --      --      --      --      --      --      --      --      --      --      --
21        waln:        3       0       6       4       2      11       7       6       9       6       9       5       9       5      10       7

What seems to work best so far is 16, where there is no shape bias at all. Maybe instead of the shape bias approach, I should use the shape information (ie. nearest neighbor filter on loop and brightness parameters) to restrict the search carried out by the appearance model.

So try this:

  • switch to pairwise svm, see if this gets similar results
  • switch to using bias as listed above (which, at the moment, only

works in pairwise mode I think?)

Notes (from Meeting w/ Jim)

Need at least a smattering of small features to catch things like beaded walls (e.g. Olive) On the other extreme: perhaps they are well-described by just one BIG sift feature? I need to consider feature size more carefully.

The word "weed" includes: sagebrush (artemisia), chenopod, ambrosia (ragweed), rumex

Oct 9

Modified classifypollen

Code revision 1883 includes the following modifications:

  • New attribute strict can have 3 values
    • 0: nearest-neighbor stage merely weights classifications
    • 1: nearest-neighbor filters final classification
    • 2: same as 1, but clutter category is also filtered
  • Default values (first 3 seen below)
 defaults= struct(                    ... % List of image files (if numeric, 
      'clutter_bias',   1.00 , ...
      'shape_bias'  ,   0.20 , ...
      'radius_bias' ,     5  , ... 


Tests

cd ~/pollen/data/03-25-08
files= findpollen();
[g,t]= deal({}); 
for i=1:3, [g{i},t{i}]=classifypollen(files,'strict',i-1), xxx ); 
end

Where xxx is replaced with the following parameters:

vision310
'radius_bias',3
vision309
'radius_bias',5
vision302
'radius_bias',5,'clutter_bias',0.8

This gives a total of 9 values, stored in pollen_mar_Oct10

Results

>> viewcount({'expert','310 0','310 1','310 2','308 0','308 1','308 2','302 0','302 1','302 2'},t310{1},g310{1},g310{2},g310{3},g308{1},g308{2},g308{3},g302{1},g302{2},g302{3});
                  expert   310 0   310 1   310 2   308 0   308 1   308 2   302 0   302 1   302 2
 1        birc:        3       6       6      21       7       6      20       5       9      22
 2        bubb:       --      86      19      20      81      28      26      98      30      30
 3        chen:       --       1       2      12       2       0      10       1       1      11
 4        clut:       --     844     916     691     866     922     701     772     861     688
 5        cypr:        3      16      19      31      16      20      31      19      21      28
 6        euca:        1       0       1       2       0       2       2       2       2       5
 7        fung:       --       7       0       0       8       0       0      10       0       1
 8        gink:        3       8       2       4       6       1       4       7       4       7
 9        gras:        2       3       6      12       2       7      12       3       4      12
10        liq.:        3       2       3      13       1       2       9       1       2       9
11        liqa:       --       2       3      13       1       2       9       1       2       9
12        mulb:       37      87      99     199      78      86     189     113     120     191
13         oak:       57       8      11      31      11      17      32      19      21      31
14        oliv:       --      29      21      47      19      17      43      31      23      49
15        palm:       --       0       0       2       1       0       1       0       0       2
16        pine:        8      46      35      33      46      32      35      59      39      39
17        popl:        7       0       0       0       0       0       0       0       0       0
18        sage:        8       5       5      16       4       5      19       5       6      17
19        syca:       13      --      --      --      --      --      --      --      --      --
20        unkn:       47      --      --      --      --      --      --      --      --      --
21        waln:        3      10      12      25       9      12      28      12      17      26

Oct 12

Improvements to implement:

  1. Make sure training (and testing?) sets don't have interrupted loops
    1. current center is just the mean of contour points
    2. but contour points may not be smoothly distributed (ie gaps)
    3. this can happen also with "post-pinching"
    4. also caused by training sets being cropped (when running getstats)
    5. determine radius, ellipticity and jaggedness in a less fragile way?
    6. maybe compute loop stats in a totally different way (ie. use fft of r, first few components)
  2. Of nearest n neighbors, remove those that are too far away
    1. adds an additional parameter nnlimit
  3. Use pairwise multi-class comparisons and limit the pairs
    1. better discriminative power?
  4. Add old clutter set back in as additional category (see mar/images9) DONE
    1. call it clutter1 and the current one "clutter2
    2. right now nn filtering stage often misses the clutter category
  5. Remove detections that overlap right and bottom edges FIXED
    1. currently only removing those on left and top
    2. their ellipticity is wrong, misclassified as pine/eucalyptus??
  6. Better choice of sift grid DONE
    1. regular scales instead of random
    2. how discriminative is it to just use the entire box as one big feature?
    3. Different appearance models based on large, medium and small features
    4. lower cutoff threshold (right now misses, for example, structure inside pine)
    5. different cutoffs for getsift if it is a file or cropped image?

SIFT features extraction (#6 above)

Going to try extracting features at 3 different scales and generating 3 different appearance models. This will have a few advantages

  • Find out which level (if any) is doing the most work
  • Capture a broader range of scales without "diluting" them all up during matching
    • For example, Jim often points at very small features in walls
    • Each scale will now have its own customized vocabulary
  • May need fewer features overall, therefore faster extraction at test-time
getsift(f,'scales',[16 14],'s',[12 12],'frame',.5);
getsift(f,'scales',  [8 7],'s',[18 18],'frame',.2);
getsift(f,'scales',  [4 3],'s',[24 24],'frame',0);

Which look like this:

Sift16 14.png
Sift8 7.png
Sift4 3.png


See siftpollen.m: (keeps backwards-compatibility with current syntax)

switch feature_mode
  case 0 % features used originally
    scales= [6 2];   grid= [48 48];  frame=1.0;
  case 1 % larger scale (worked better)
    scales= [8 4];   grid= [48 48];  frame=1.0;
  case 2 % new large scale
    scales= [16 14]; grid= [24 24];  frame=0.5;
  case 3 % new medium scale
    scales= [8 7];   grid= [24 24];  frame=0.2;
  case 4 % new small scale
    scales= [4 3];   grid= [24 24];  frame=0.0;
end

Note: grid values must be divisible by 4 in SPM.

Testing these features now:

cd ~/pollen/mar/sift9/large; 
cats=getcategories; 
cats([7 8 11 15 17])= []; 
rand_seed(1); % very important if summing match kernels later!
histpollen('../hist9/large/1',cats,50,50,100);
trainpollen(50,50,100); 
svm1= testpollen(50,50,100);


...and so on for medium and small. The thinking is that, since the features are not being smeared over many scales, that nwords=100 will now capture enough distinctive features (not to mention, it's faster).

Oct 14

Meeting with Jim

On meeting with Jim and talking this over, I think I have a notion of why the small features seem to contribute very little. I'm adding a parameter thresh=0.5 which will keep it from matching images based on how much whitespace they contain. As you might imagine, this will vary a bit between the training set (where boxes are determined by a human) and the testing set.

See examples below:

f='palm/April06c_00190.jpg'
getsift(f,'scales',  [4 3],'s',[24 24],'frame',0,'thresh',xxx);
thresh=0.0
thresh=0.5
thresh=1.0


Note that larger-scale features are hardly affected because the gaps are usually so small that bigger features don't see them.

I'm going to use thresh=0.5 because it seems important to keep some inner structure (in the mulberry for example). The disadvantage would be a little less emphasis on the pollen walls. But still much more emphasis than you get with thresh=0.0!

Nevermind

Turns out I was already using thresh=0.9 (in both siftpollen and classifypollen. This should really be putting the emphasis on the pollen walls already. I could still try thresh=0.5 but it is doubtful that this would make a huge difference. Might help with things like mulberry, which is currently a problem category.

Going to run sift9/wide2 which corresponds to feature_mode=6 and focuses a little more on the interior. Just for comparison. Also switching from scale=[16 4] to scale=[12 4] in the hopes of using interior texture to some benefit.

Loop Statistics

Much-improved loop statistics now make use of fapfft and catch the first few fourier components of the radius.

cd ~/pollen/mar/images9; 
cats= getcategories; 
makestats(cats,500);

One rough way to see if this is working is to plot just the first 3 principle components

load stats.mat
[xmean,xstd]= deal(mean(stats),std(stats));
stats= rescale(stats,xmean,xstd);
[eigenvecs,z,eigenvals]= princomp(stats);
newstats= stats*eigenvecs;
f1= find(cats<=10);
f2= find(cats>10);
hold off;
scatter3(newstats(f1,1),newstats(f1,2),newstats(f1,3),20,cats(f1),'filled');
hold on;
scatter3(newstats(f2,1),newstats(f2,2),newstats(f2,3),20,cats(f2)-10);
plot(eigenvals,'*');

The thought here is that, since the first 4 components contain most of the information, I should do the nearest-neighbor search in this space. In fact, as long as we're going to bother projecting the stats into another space might as well try Fisher LD.


Oct 15

New nearest-neighbor model (using PCA-ed loop statistics) is kicking ass.

3-25-08_1036.tif: example of finding a mulberry that was missed?

Next:

  • shape model apparently lacks big bubble pieces (3-25-08_1170.tif). Keeps calling pine.
    • or, alternately, add clutter_bias to clutter1, clutter2 AND bubble
  • add sycam to training set, remove olive and palm
301: cd ~/pollen/mar/images10;  siftpollen('../sift10',getcategories,0,2,5);
305: cd ~/pollen/mar/images10;  siftpollen('../sift10',getcategories,1,2,5); 

note: have not added big bubble pieces yet, but adding them will mean rerunning siftpollen for just a quick incremental update.

Oct 16

Final Flight Check: Appearance Model

[greg@vision304 images10]$ cat DUWC             >> showstrings(categories,2)
627      birch                                   1 birch      2 bubble
105      bubble                                  3 chenop     4 clutter1
102      chenop                                  5 clutter2   6 cypress
3492     clutter1                                7 eucalyp    8 fungus
803      clutter2                                9 ginkgo    10 grass
381      cypress                                11 liqamber  12 mulberry
64       eucalyp   % < 100                      13 oak       14 pine
78       fungus    % < 100                      15 poplar    16 sageb
241      ginkgo                                 17 sycam     18 walnut
562      grass                            
62       liqamber  % < 100
576      mulberry
483      oak
1942     pine
53       poplar    % < 100
197      sageb
452      sycam
540      walnut
confusion matrix for Ntrain=50
cd ~/pollen/mar/sift10/; 
cats=getcategories; 
cats([7 8 11 15])= []; % kill categories too small to train/test                                                                          
rand_seed(1);
histpollen('../hist10/1',cats,50,50,200);
trainpollen(50,50,200); 
svm1= testpollen(50,50,200);

...and so on for run 2, yielding the mean confusion matrix at left.

Presumably the actual appearance model will have better performance because it uses Ntrain=200.


Generate New Improved Appearance and Shape Models (March)

If we don't need to split datasets into train/test, we can now afford to use more training example:

cd ~/pollen/mar/sift10/;                        cd ~/pollen/mar/images10;
cats=getcategories;                             cats=getcategories;
rand_seed(1);                                   rand_seed(1);
histpollen('../hist10/1',cats,200,0,200);       makestats(getcategories,200);
trainpollen(200,0,200);

There's nothing sacred about rand_seed. I'm just trying to make the results reproducible. Also, there's some concern that if I used npercat=1000 in makestats (as I originally did) it heavily biases large categories, to the exclusion of those with <100 elements (including fungus and bubble).

For the record: I'm at subversion code revision 1896.

Oct 17

Last Night's Counts

cd ~/pollen/data/03-25-08
files= findpollen();
[gxxx,txxx]=classifypollen(files,'displaymode',false,'verbose',0);

With the following extra parameters:

vision310
vision304
'shape_bias',.25,
vision302
'clutter_bias',.5
vision112
'radius_bias',10

This gives a total of 4 values, stored in pollen_mar_Oct17:

>> viewcount({'expert','310','304','302','121'},t310,g310,g304,g302,g121);
                  expert     310     304     302     121
 1       birch:        3       6       8       6      10
 2      bubble:       --      23      23      22      25
 3      chenop:       --       2       0       0       0
 4    clutter1:       --      84      97      80      92
 5    clutter2:       --     505     515     540     463
 6     cypress:        3       8       8       5       9
 7     eucalyp:        1       2       1       2       4
 8    eucalyp.:        1      --      --      --      --
 9      fungus:       --       1       2       1       2
10      ginkgo:        3      10      11       7       8
11       grass:        2       5       6       4       5
12    liq.ambe:        3      --      --      --      --
13    liqamber:       --       2       1       1       0
14    mulberry:       37      36      25      28      29
15         oak:       57      32      29      24      30
16        pine:        8      16      14      14      20
17      poplar:        7       0       0       0       0
18       sageb:        8       3       3       2       3
19      sageb.:        8      --      --      --      --
20       sycam:       13      12      10       9       7
21      sycam.:       13      --      --      --      --
22     unknown:       47      --      --      --      --
23      walnut:        3       3       4       4       6

The "Final" March Count

None of the parameters I added lat night helped performance, so I'll stick with the default (that ran on vision310):

viewcount({'expert','computer'},[2 4 5 8 13 19 21 22],t310,g310);
(excluding clutter, bubbles and other garbage)


birch 3 6
bubble -- 23
chenop -- 2
clutter1 -- 84
clutter2 -- 505
cypress 3 8
eucalyp 1 2
fungus -- 1
ginkgo 3 10
grass 2 5
liqamber 3 2
mulberry 37 36
oak 57 32
pine 8 16
poplar 7 0
sageb 8 3
sycam 13 12
unknown 47 --
walnut 3 3


Revised Appearance Model

Using sift features of type 1 instead of type 5 (I'm not sure type 5 are performing very well).

confusion matrix for Ntrain=50
cd ~/pollen/mar/sift10/1; 
cats=getcategories; 
cats([7 8 11 15])= []; % remove smaller categories                                                                   
rand_seed(1);
histpollen('../hist10/1/1',cats,50,50,200);
trainpollen(50,50,200); 
svm1= testpollen(50,50,200);


confusion matrix for Ntrain=50



Here is the older Oct 16 model performance for comparison


I have yet to try this on the final counts, to see if they improve.

Meanwhile in August, Month of the Chinese Elm

classifypollen(files,'traindir','aug','histdir','hist8/1'...
               'statdir','images8','shape_bias',.25,
               'showloops',false,'htmldir','html');

Nov 18 : Stuff for Demo

March 2008

Rerunning sift features from scratch, just to make sure.

images20
a copy of images8, except that clutter has been renamed clutter2, and images7b/clutter has been added as clutter1. This was done in a hack sort of way with links before in mar/images10
images21
same as above except oak of questionable size has been moved to oak/questionable so that it is no longer part of the training set
cd ~/pollen/mar/images20; 
rand_seed(1); 
makestats(getcategories,400);
siftpollen('../sift20',getcategories);
histpollen('../hist20/1',getcategories,200,0,200);
trainpollen(200,0,200);

Classification next:

cd ~/pollen/data/03-25-08
files= findpollen();
displayargs=  {'displaymode',false,
               'verbose',0};
classifyargs= {'traindir','mar',
               'histdir','hist20/1',
               'statdir','images20',
               'trainsuffix','200_000_200'};
[gxxx,txxx]=classifypollen(files,displayargs{:},classifyargs{:});

Now repeat the same thing, replacing 20 with 21. This will tell me if tweaking the oak training set helped at all.

Demo

For the following sets:

  • Aug 2008
  • Mar 2008
  • Jan 2008

Show

  • classifypollen in action (realtime)
  • Final histogram counts
cd ~/pollen/data/09-22-08
files= findpollen();
classifyargs= { 'traindir'     'aug'   ...
                'histdir'    'hist8'   ...
                'statdir'  'images8'   ...
                'shape_bias'    .25    ...
                'trainsuffix', '200_000_200' };
classifypollen( files, classifyargs{:});
take pollen_aug_Nov17
taug.types{1}= 'chinelm';
viewcount({'expert','computer'},[1 3 5 9],taug,gaug);
zlim([0 120])
cd ~/pollen/data/03-25-08
files= findpollen();
classifyargs= { 'traindir'     'mar'    ...
                'histdir'  'hist20/1'   ...
                'statdir'  'images20'   ...
                'trainsuffix', '200_000_200' };
classifypollen( files, classifyargs{:});
take pollen_mar_Oct17
viewcount({'expert','computer'},[2 4 5 8 13 19 21 22],t310,g310);
zlim([0 60]);
take pollen_mar_Nov18
viewcount({'expert','computer'},[2 4 5 8 13 19 21 22],t20,g21b);
zlim([0 60]);

March: What Now?

  • Which is the best appearance + shape model?
  • Benchmark with jan/feb/mar datasets
    • Size
    • Appearance
    • Size + Appearance
  • How to make improvements?
    • Dataset modifications?
    • Use nbnn + spm?
    • Parameter search
    • Add different features?
  • Simplify
    • Instead of generating separate jan/feb/mar etc. directories
    • Have one 1-vs-1 model, and mark unwanted categories
    • Sanity check: does 1-vs-1 work as well as 1-vs-all?

Running the Ubermodel

On 16 machines at once (approx 30 minutes per machine):

cd ~/pollen/images
siftpollen('../sift',getcategories,0,16);

On one of the 8-processor machines, histogramming takes about 15 minutes (utilizing ~600% of the CPU) and generating the match kernel takes another 15 minutes (~430-530% utilization). That's about 25-28 kilomatches per second, with one other process running on the machine besides mine.

matlabpool  % using MATLAB 2009a I now get 8 labs
cd ~/pollen/sift
rand_seed(1);
histpollen('~/pollen/hist/1',getcategories,200,0,200);
trainpollen(200,0,200);

Unfortunately I am not utilizing parfor yet in makesvm or in any of its daughter routines. So it's utilization is still just 100%.

Testing the Appearance Model

>> cd ~/pollen/images
>> cats= getcategories;
>> showstrings(cats)        >> !duwc [a-z]*  
   1 alder                     723      alder          
   2 ash                       555      ash
   3 birch                     627      birch
   4 bubble                    105      bubble
   5 chenop                    102      chenop
   6 chinelm                   2596     chinelm
   7 clutter1                  3492     clutter1
   8 clutter2                  803      clutter2
   9 crepemyrtle               19       crepemyrtle     % <100
  10 cypress                   381      cypress
  11 eucalyp                   64       eucalyp         % <100
  12 fungus                    78       fungus          % <100
  13 ginkgo                    241      ginkgo
  14 grass                     562      grass
  15 jacaran                   79       jacaran         % <100
  16 liqamber                  62       liqamber        % <100
  17 mulberry                  576      mulberry
  18 oak                       420      oak
  19 olive                     842      olive
  20 palm                      85       palm            % <100
  21 pecan                     12       pecan           % <100
  22 pine                      1942     pine   
  23 poplar                    53       poplar          % <100
  24 sageb                     197      sageb
  25 sycam                     452      sycam
  26 walnut                    540      walnut

Categories 9,11,12,15,16,20,21,23 are too small.

cd ~/pollen/sift;
rand_seed(1);
cats=getcategories;
cats([9 11 12 15 16 20 21 23])=[];
histpollen('~/pollen/hist/1',cats,50,50,200);
trainpollen(50,50,200); 
svm1= testpollen(50,50,200);

Groupings

Confuse pollen Mar26 2009.png

What sort of groupings are found based on interconfusion?

cd ~/pollen/hist/1
[conf,cats]=fromstruct(svm1,'conf','categories');
clist= 1:length(cats);
groups= makegroups(conf,clist,2,5); 
viewgroups(conf,groups([1 4 3 2 5]),cats);


NBNN

cd ~/pollen/sift;
rand_seed(1);
cats=getcategories;
cats([9 11 12 15 16 20 21 23])=[];
makenn('~/pollen/nn/1',cats,50,50);
makenbnn('.',50,50,25);
runnbnn(50,50,25);

optionally,

cd ~/pollen/sift;
makenn('~/pollen/nn/1',cats,50,50,'teston','train');
makenbnn('.',50,50,25,'teston','train');
runnbnn(50,50,25,'teston','train');

Non-test Model: The real deal

Just to recap:

cd ~/pollen/mar/sift;                      cd ~/pollen/mar/images;
cats=getcategories;                             cats=getcategories;
rand_seed(1);                                   rand_seed(1);
histpollen('~/pollen/hist/1',cats,200,0,200);   makestats(getcategories,500);
trainpollen(200,0,200);

How to improve on the old way?

  • Add NBNN
  • Combine multiple SPM models (but are they entirely redundant??)
  • Add different features

02-19-09

Class Linecount LabelGUI Computer
v1
Computer
v2
Computer
v3
alder 2 1
ash+privet 4+6 3
birch 3 2
cypress 5 4 5 4 3
eucalyp 3 1
grass 2 NA 3 1
olive 1 1 1
oak 3 1 1 1
palm 4 1
pine 13 7 16 17 6
poplar+cottenwood 5 4
sycam=planetree 3 6 6 6 3
unknown 2 8 NA

v2: ntrain_bias=0.5; shape_bias=0.2; clutter_bias=0.0;
v3: ntrain_bias=0.5; shape_bias=0.2; clutter_bias=-0.0

Here is a more complete record of the options that are in use for the above classifypollen tests.

April 7, 2009

load model.mat
boxes= classifypollen(findpollen,'verbose',true,'clutter_bias',0,...
                      'shape_bias',1.0,'possible',possible,'strict',1,...
                      'radius_bias',3,'outdir','images');

April 10, 2009

  • Cross-validation
    • Use half the dataset to classify the other half.
    • Do this repeatedly and gather statistics.
    • Iteratively remove the worst offenders.
      • What do they look like
  • Shape stats before and after the above process
    • Are the stat distributions tighter?
    • Can stats be used to cull the training data even more?
    • Iteratively remove n-sigma outliers?
  • Try new dataset against Jim's most recent datasets (02-19-09, etc.)
    • Better performance?

Make master hist file (step 1)

cd ~/pollen/sift 
cats= getcategories 
files= pickfiles(cats);
makehist('~/pollen/spm',{},'999_999_200.mat',files,{});

Investigate data pruning based on size

Pollen Apr 10 2009.png
subplot(26,1,1)
n= max(cats);
for i=1:n,  
  f= find(cats==i); subplot(13,2,i);  
  hist(stats(f,1),0:150); 
  xlim(0,150); title(categories{i}); 
end
scalesubplots(1.2,2.0);
end

There are some questionable areas outlined in red. I need to

  • scale these into units of microns
  • overlay the limits we found in the classification manual.
  • should I prepend the radius (in microns) to the filename, for convenience?


sizenames

cd ~/pollen/images;
sizename('~/pollen/sizes');

Then I renamed sizes as images and put the old images in 2009/Apr10.

This allows me to easily sort by size (in gthumb) and remove size outliers that are pointed out in the histograms to the right.

Side note: one unintended consequence of this is that it makes it easy to spot duplicates. For example, there was a duplicate in ash and multiple duplicates in alder.

Problem with duplicates

Holy crow now that I'm sorting by size with the new filename scheme, I can see that there are a ton of duplicates in some categories. Let's try to find these later: should find for example that

03457_Jan06_00635.jpg
03574_Jan06_01363.jpg
03631_Jan06_00083.jpg
03724_Dec05_NoPine_00078.jpg

are all duplicates.

Conclusion: solve this later in the match stage by looking for abnormally high match kernel values.

Pruning

Meanwhile I moved about 340 images into _questionable subdirectories, mostly by looking for questionable files in the lower and upper size range for the category.

Running the New Pruned Data Set

cd ~/pollen/images;
siftpollen('~/pollen/sift',getcategories,0,4);
siftpollen('~/pollen/sift',getcategories,1,4);
siftpollen('~/pollen/sift',getcategories,2,4);
siftpollen('~/pollen/sift',getcategories,3,4);
cd ~/pollen/sift 
cats= getcategories 
files= pickfiles(cats);
makehist('~/pollen/spm',{},'999_999_200.mat',files,{});

Duplicates

semilogy(thresh,n,'x-');
grid on;

What threshold to use for identifying duplicates? Explore:

cd ~/pollen/spm;
load match999_999_200.mat;
thresh=0.5:0.02:1.0;
n= thresh*0;
nthresh= length(thresh);
for i= 1:nthresh
   printf('%d %d\n',i,nthresh);
   [x,y]=findmatch(mtrain,thresh(i)); 
   n(i)= length(x);
end

In conclusion, 0.6667 isn't bad. This is the new default for findmatch.

Duplicate Removal

Continuing from the above,

cd ~/pollen/images;
pruneduplicates(ftrain(y));

Here are the results:

                  Apr10    Now
 1          alder  723     701
 2            ash  555     551
 3          birch  627     595
 4         bubble  105      88
 5         chenop  102      85
 6        chinelm 2596    2585
 7       clutter1 3492    3430
 8       clutter2  803     795
 9    crepemyrtle   19      16
10        cypress  381     369
11        eucalyp   64      64
12         fungus   78      78
13         ginkgo  241     225
14          grass  562     497
15        jacaran   79      74
16       liqamber   62      57
17       mulberry  576     573
18            oak  420     412
19          olive  842     811
20           palm   85      81
21          pecan   12      12
22           pine 1942    1758
23         poplar   53      53
24          sageb  197     189
25          sycam  363     359
26         walnut  540     525   

Testing The Appearance Model: Before and After

Apr10 Model
Current Model

Notice that largest number of categories is 85 now, not 100. So using ntrain=40 instead of 50.

rootdir='~/pollen/2009/Apr10'; 
% rootdir='~/pollen';
for i=1:10,
   chdir(rootdir);
   chdir('sift');
   rand_seed(i);
   pwd
   spmdir= sprintf('%s/spm/%d',rootdir,i)
   cats=getcategories;
   cats([9 11 12 15 16 20 21 23])=[];
   histpollen(spmdir,cats,40,40,200);
   trainpollen(40,40,200); 
   svm{i}= testpollen(40,40,200);
end

To view that,

svm= {}; 
for i=1:10, 
    cd(sprintf('%d',i)); 
    svm{i}=testpollen(40,40,200); 
    cd('..'); 
end 
for i=2:10, 
    svm{1}.conf= svm{1}.conf + svm{i}.conf; 
end; 
svm{1}.conf=svm{1}.conf/10;

Overall performance improved only 3% (from 63.75 to 66.95). But this doesn't tell the whole story

  • Out-of-sample data had been contaminated with duplicates of in-sample images
    • ~550 duplicate have been removed
    • This the new classification tests is more difficult, and
    • Test results should now generalize better to out-of-sample sets
  • Removed many images whos shape stats were outliers
    • Better nn shape filter, presumably?

Running Some Actual Models (for use by classifypollen)

for i=1:10,
   rootdir='~/pollen';
   chdir(rootdir);
   chdir('sift');
   rand_seed(-1);
   pwd
   spmdir= sprintf('%s/spm/%d',rootdir,i)
   cats=getcategories;
   histpollen(spmdir,cats,100,0,200);
   trainpollen(100,00,200); 
end

The idea is to have classifypollen read multiple classifiers and aggregate all their margin scores to see who wins.Does this improve performance?

Let's try some counts and find out!

f= findpollen;
boxes= classifypollen(f,'verbose',true,'clutter_bias',-0.1,'shape_bias',0.5,...
'possible',possible,'radius_bias',3,'strict',0);
boxes= classifypollen(f,'verbose',true,'clutter_bias',-0.0,'shape_bias',0.5,...
'possible',possible,'radius_bias',3,'strict',1);

Appearance Bias

Need a new bias which biases against higher-performing categories. Pine just seems to get called more than other things... there should be something that biases the appearance margins accordingly.

Use ntrain/ntest/ntrial=40/40/10 results above to establish a useful bias. Then multiply margins by this.

At the moment things like alder and ash have a tough time getting called, because the margins are simply never all that high.

Update: did not implement this. If SVM is doing it's job, it does not seem like this should matter

Current Best Parameters?

boxes= classifypollen(f(end:-1:1),'verbose',true,'clutter_bias',0.333333,...
'shape_bias',1.0,'possible',possible,'radius_bias',1,'strict',0,...
'ntrain_bias',1.0);

Not very satisfied with the performance here. I need to go back to basics and

  • Evaluate shape model separately
  • Evaluate appearance model separately
    • Need to model blurred/non-blurred versions of each pollen?
    • Evaluate test performance vs. test performance for 2009 datasets


How about using scores instead of votes?

Is n being incremented in the correct place in pairwise.fwd

Apr 21

Need to evaluate two things separately, using the latest testing data. And figure out how and why they aren't working.

Out-of-sample Test Set

I'm going to start using the in-sample pre-2007 data for training and test directly against the out-of-sample 2008-2009 data. This is because I often get great performance when dividing pre-2007 data into train/test sets, but it feels like the performance is not generalizing well to the 2008-2009 data.

If that's true, let's figure out why.

For now the model is bogus for all but 02/19/09. We just want to extract test images so the classifications don't matter.

Note: for now, just saves the expert boxes. This means no clutter category. To do this we would have to take all boxes identified by the computer that were not labeled by the expert. This may be a logical next step. For now...

cd ~/pollen/data/02-19-09
load model.mat
args={'possible',possible,'outdir','images','displaymode',false};
% already did this on a previous data
% boxes= classifypollen(findpollen,args{:});
cd ~/pollen/data/02-22-09; boxes= classifypollen(findpollen,args{:});
cd ~/pollen/data/03-03-09; boxes= classifypollen(findpollen,args{:});

That turns out to be a pretty pitiful sampling, so did the above for 01-08-08, 03-25-08, and 09-22-08 as well.

Now consolidate into one testing set.

mkdir ~/pollen/out
cd ~/pollen/data
for dir in 02-19-09 02-22-09 03-03-09 03-25-08 09-22-08; do
  echo $dir
  rsync -ax $dir/images/ ~/pollen/out/images/
done
cd ~/pollen/out/images; duwc *

With the result:

1        alder        33       ash            13       birch
0        chin.        101      chin.elm       15       cypress
6        eucalyp.     3        ginkgo         4        grass
3        liq.amber    42       mulberry       58       oak
2        palm         210      pine           14       poplar  
3        rumex        9        sageb.         26       sycam.
120      unknown      3        walnut

Finally some trivial renaming is required

cd ~/pollen/out/images
mv chin.elm chinelm
mv eucalyp. eucalyp
mv liq.amber liqamber
mv sageb. sageb
mv sycam. sycam
rm -rf chin.

Then run it like any other database:

cd ~/pollen/out/images
siftpollen('../sift',getcategories);

Note: had to remove bogus files: ash/3-3-09TL_0367_001.jpg

Finally merge in and out-sample data into one big dataset. Just as there is 101,256,357 so there will now be (default),out,inout:

mkdir inout
rsync -ax out/images/ inout/images/
rsync -ax out/sift/ inout/sift/
rsync -ax images/ inout/images/
rsync -ax sift/ inout/sift/

Out-Of-Sample Evaluation of Appearance Model

Analogous to:

cd ~/101/sift/80x60; cats101= getcategories; files101= pickfiles(cats101);
cd ~/256/sift/80x60; cats256= getcategories; files256= pickfiles(cats256);
cd ~/357/sift/80x60; makehist('~/357/spm/80x60',{},'101_256_200.mat',files101,files256);

we now try:

cd ~/pollen/sift; cats= getcategories; filesin=  pickfiles(cats);
cd ~/pollen/out/sift;  cats= getcategories; filesout= pickfiles(cats);
cd ~/pollen/inout/sift
makehist('~/pollen/inout/spm',{},'001_002_200.mat',filesin,filesout);

This may take a couple hours.

Bug?

Meanwhile, need to think about potential bug in NIPS2009 paper topic as well as pollen here (postmatch.m):

% get training/test class values
[ctrain,Ctrain]= file2class(ftrain);
[ctest, Ctest ]= file2class(ftest);

What if categories in ftrain and ftest are different: are the mappings of categories to numbers still consistant across train/test?

Now I think I've fixed this in both postmatch and makesvm

Appearance Model Results

svm= makesvm(50,10,1,'method',13); viewconfuse(svm);

Below: some test categories are entirely absent, yet the train/test category numbers seem to be consistent.

Right: Final confusion matrix

cd ~/pollen/inout/spm
postmatch(001,002,200,50,10,1);
loadmat('match',50,10,1);
top; hist(ctrain,0:29); xlim(0,30);
bot; hist(ctest,0:29); xlim(0,30);


f=find(sum(svm.conf)>0); d= diag(svm.conf); mean(d(f)); viewconfuse(svm.conf,svm.categories,f);

Actual performance across the categories we care about is more like 19%. Easier to see this way.


Learning a Bias for the Appearance Margins

Pollen Apr 22 2009 bias.png

The idea is that some kinds of pollen (pine??) get called a lot, but that some stragglers (ash?) need a little extra margin boost. Learning what this bias looks like:

[bias,perf]=makebias(5,50,10,1,'method',13); 

The danger here is: what if these biases don't extrapolate well to future datasets in different months?

Did this 4 times and averaged the resuts (see plot at left).


This can be found in ~/pollen/inout/spm/bias.mat:

bias= [0.950 0.850 1.225 1.000 1.000 1.378 0.825 0.900 1.000 1.150 0.988 0.869 1.000 0.900 0.800 1.091 1.150 1.207 0.972 1.150 1.000 1.184 1.000 1.038 0.944 1.000];


Now try randomizing over different samplings:

bias={}; perf=[];
for i=1:20,
  hostlabel(num2str(i));
  postmatch(001,002,200,50,10,i);
  [b,p]=makebias(5,50,10,1,'method',13);
  bias{i}= b; perf(i)= p;
end

Trying this on a 2nd machine (to see if unequal numbers of training example per category is a factor).

bias={}; perf=[];
for i=1:20,
  hostlabel(num2str(i));
  postmatch(001,002,200,200,10,i);
  [b,p]=makebias(5,200,10,1,'method',13);
  bias{i}= b; perf(i)= p;
end

Question: are the categories that need biasing the same as the categories that had less than 50 samples (and are thus under-represented in the training?)

Pollen Apr 22 2009 bias2.png
top;
load bias50_10
mean=bias{1}*0; 
for i=1:20, 
   plot(bias{i},'.'); hold on; mean=mean+bias{i}; 
end; 
plot(mean/20,'ro'); title('bias 50 10','fontsize',16)

bot
load bias200_10
mean=bias{1}*0; 
for i=1:20, 
   plot(bias{i},'.'); hold on; mean=mean+bias{i}; 
end; 
plot(mean/20,'ro'); title('bias 200 10','fontsize',16)

These correspond to the following categories (26 instead of 28 because rumex and unknown have no corresponding training category):

>>showstrings(categories([1:23 25 26 28]),5)
   1 alder         2 ash           3 birch         4 bubble        5 chenop
   6 chinelm       7 clutter1      8 clutter2      9 crepemyrtle  10 cypress
  11 eucalyp      12 fungus       13 ginkgo       14 grass        15 jacaran
  16 liqamber     17 mulberry     18 oak          19 olive        20 palm
  21 pecan        22 pine         23 poplar       24 sageb        25 sycam
  26 walnut

Keep in mind that the test set was pretty limited, and only the following 19 categories are represented:

>> load svm050_010_001.mat
>> showstrings(categories(unique(svm.ctest)),5)
   1 alder      2 ash        3 birch      4 chinelm    5 cypress
   6 eucalyp    7 ginkgo     8 grass      9 liqamber  10 mulberry
  11 oak       12 palm      13 pine      14 poplar    15 rumex
  16 sageb     17 sycam     18 unknown   19 walnut

This means that only the bias values for

>> useful= intersect([1:23 25 26 28],unique(svm.ctest));
>> showstrings(categories(useful),5)
   1 alder      2 ash        3 birch      4 chinelm    5 cypress
   6 eucalyp    7 ginkgo     8 grass      9 liqamber  10 mulberry
  11 oak       12 palm      13 pine      14 poplar    15 sageb
  16 sycam     17 walnut

are very useful.The others should probably be set to 1. At the end of the day, something like this:

>> useless= setdiff([1:23 25 26 28],unique(svm.ctest));
>> useless(3:4)= []; % clutter biases are useful
>> load bias50_10
>> mean(useless)= deal(1);
>> spmbias= mean;
>> sav spmbias.mat spmbias
My hope is that the above biases will improve the appearance model.

May 4

02-19-09

cd ~/pollen/data/02-19-09
load model.mat
load spmbias3.mat
pollen=classifypollen(findpollen,'possible',possible,'verbose',1,'spmbias',spmbias);

04-23-09

According to archived data from last year,possible classifications are:
    weeds, grass, pine, oak, birch, polar, plane (sycam), eucalypt, olive, ginko, palm, sagebrush, chenopod, ambrosia

The available classifications are:

   1 alder         2 ash           3 birch         4 bubble        5 chenop        6 chinelm
   7 clutter1      8 clutter2      9 crepemyrtle  10 cypress      11 eucalyp      12 fungus
  13 ginkgo       14 grass        15 jacaran      16 liqamber     17 mulberry     18 oak
  19 olive        20 palm         21 pecan        22 pine         23 poplar       24 sageb
  25 sycam        26 walnut

So

possible= sort([14 22 18 3 23 25 11 19 13 20 24 5 7 8]);

which is stored in ~/pollen/data/04-23-09/model.mat.

cd ~/pollen/data/04-23-09
load model.mat
load spmbias.mat
pollen=classifypollen(findpollen,'possible',possible,'verbose',1,'spmbias',spmbias);

To Try Next

  • Single-scale or 2-scale SIFT features (no range of randomized scales)
    • Or how about simpler, non-sift features?
    • Edges, perhaps?
  • Optimize shape stats
    • Stats act on blurred images now, so no "rattiness" metric
    • How about using edge detector instead
      • might help with bright edges ala 4-26-09_0430
  • Semi-supervised labelGUI
    • How many images can be totally avoided?
  • Optimize for parfor
    • What's taking the most time right now?
      • load (preload?)
      • convolve
      • (find loops)


May 28

Below xxx values tried are 050 100 200 400 800:

cd ~/pollen/16x16/sift;
cats=getcategories;
histpollen('~/pollen/hist',cats,9999,9999,xxx);
postmatch(9999,9999,xxx,50,20,1,'post',2);

   1:  701 => (  50   10)          2:  551 => (  50   10)          3:  595 => (  50   10)          4:   88 => (  50   10)
   5:   85 => (  50   10)          6: 2585 => (  50   10)          7: 3430 => (  50   10)          8:  795 => (  50   10)
   9:   16 => (  16    0)         10:  369 => (  50   10)         11:   64 => (  50   10)         12:   78 => (  50   10)
  13:  225 => (  50   10)         14:  497 => (  50   10)         15:   74 => (  50   10)         16:   57 => (  50    7)
  17:  573 => (  50   10)         18:  412 => (  50   10)         19:  811 => (  50   10)         20:   81 => (  50   10)
  21:   12 => (  12    0)         22: 1758 => (  50   10)         23:   53 => (  50    3)         24:  189 => (  50   10)
  25:  359 => (  50   10)         26:  525 => (  50   10)
bad=[9 16 21 23];good= setdiff(1:26,bad);viewconfuse(svm,{},good);
showstrings(cats,4)
 
   1 alder         2 ash           3 birch         4 bubble
   5 chenop        6 chinelm       7 clutter1      8 clutter2
   9 crepemyrtle  10 cypress      11 eucalyp      12 fungus
  13 ginkgo       14 grass        15 jacaran      16 liqamber
  17 mulberry     18 oak          19 olive        20 palm
  21 pecan        22 pine         23 poplar       24 sageb
  25 sycam        26 walnut

trainpollen(50,10,1); 
svm= testpollen(50,10,1);

Since certain categories like olive hardly ever get called, try biasing. Note: changed dbias from 0.8 to 0.5 because it did not seem to have enough range to find the optimal bias.

[bias,perf]=makebias(10,50,10,1,'method',13);


Optimal value for nword

clf; plot(perf(:,1),'k'); hold on; plot(perf(:,2),'b'); plot(perf(:,3),'g'); plot(perf(:,4),'r'); plot(perf(:,5),'m'); legend('50','100','200','400','800');
cd ~/pollen/16x16/hist;
nword=[50 100 200 400 800]; 
perf=zeros(10,5); 
for i=1:5, 
   cd(num2str(nword(i))); 
   p= scansvm; 
   perf(:,i)=p(:,3)'; 
   cd ..; 
end

[mean(perf); std(perf)]

ans =

   45.8730   49.0536   49.8591   50.5337   50.7877
    2.5587    2.5345    2.5546    2.2381    4.6325

So 400 is optimal (2nd best performance and least variance). But not significantly so.


Optimal Bias

Pollen May 28 2009 bias.png

Ran the following on 5 different machines for 5 seed values XXX=1..5:

cd ~/pollen/16x16/hist/400; 
[biasXXX,perfXXX]=makebias(20,50,10,XXX,'method',13);

Mean bias (and individual biases) saved in ~/pollen/16x16/hist/400/bias.mat

Runs

Today's runs ( 4-26-09 ) are as follows:

pollen1
uses pre-existing model1.mat and spmbias.mat
pollen2
uses corrected model (based on actual counts)
pollen3
remove old spmbias (now spmbias1.mat) and replace with 1.0 across the obard
pollen4
null bias still, but new model based on 16x16 grid of sift features (10, not 14 or 7)
pollen5
putting bias1 in from /common/greg/pollen/16x16/hist/final

pollen2 counts way too much ginko and oak. pollen3 counts even more (no bias in place)


Sept 20, 2009

Confusion Matrix For Jim's Poster

  • Build training model using 50 images per category (N largest categories)
  • Find test images not used in training set
    • Evaluate confusion

Training Model

Shape model:

cd ~/pollen/images
makestats(getcategories,200);

Appearance Model: use existing   /common/greg/pollen/spm_v1/24x24_1014/200/hist050_010_010.mat

Create testing set:

[greg@vision401 testing]$ cd ~/pollen/data
makepollentest #creates ./testing from ./training

Consolidate labels:

cd ~/pollen/data/training
label= mergelabels;
cd ~/pollen/data/testing
save label.mat label

Keep only images with pollen in them, that aren't in training set:

files= findpollen();
load /common/greg/pollen/spm_v1/24x24_1014/200/hist050_010_010.mat
for i=1:length(ftrain), 
   indx=first(find(ftrain{i}=='_')); 
   ftrain{i}= strrep(ftrain{i}(indx+1:end),'.mat','.tif'); 
   ftrain{i}= strrep(ftrain{i},'_',''); 
end
[files,ftrain]= apply(@sort,files,ftrain);
ftest= setdiff(files,ftrain);
[length(ftrain) length(files) length(ftest)]

This leaves 7068 files in ftest to be classified:

 pollen= classifypollen(ftest,'verbose',0);

Nov 16, 2009

Baseline appearance model

Baseline: %55.8
/common/greg/pollen/spm_v1/24x24_1014/200
[svms,svm]= testpollens(50,10,1:10);
viewconfuse(svm)

Performance for crepemyrtle and pecan is zero because there are not enough training examples to construct independent train and test sets at Ntrain=50.

So the baseline performance of %55.8 is somewhat depressed due to this.