Greg's Notebook, Winter 2006-2007

From Vision Wiki
Jump to navigation Jump to search

Back to Greg's Wiki



Apr 13

tree= maketree6(2,1,[],10,25,200,plevels,2,2,1:256,1:256,[],11,100,100,2,5);

Apr 12 : Deeper Tree Testing (ICCV 2007)

On each machine running at depth=3, running the following. Could write multitree5(d,plevels,'april_12a') to automate this?

file='april_12a';
take all plevels ;
share(file,'plevels');
tree= maketree5(3,0,10,25,200,plevels,0,0.5,1:256,[],11,100,100,2,5); tree10_0=tree; share(file,'tree10_0');
tree= maketree5(3,0,10,25,200,plevels,1, 10,1:256,[],11,100,100,2,5); tree10_1=tree; share(file,'tree10_1');
tree= maketree5(3,0,10,25,200,plevels,2,  2,1:256,[],11,100,100,2,5); tree10  =tree; share(file,'tree10');
tree= maketree5(3,0,15,25,200,plevels,2,  2,1:256,[],11,100,100,2,5); tree15  =tree; share(file,'tree15');
tree= maketree5(3,0,20,25,200,plevels,2,  2,1:256,[],11,100,100,2,5); tree20  =tree; share(file,'tree20');
tree= maketree5(3,0,30,25,200,plevels,2,  2,1:256,[],11,100,100,2,5); tree30  =tree; share(file,'tree30');

Changes required (v6) for all the topdown software required to accomidate multiple levels. New syntax is:

tree= maketree6(2,0,[],10,25,200,0.0:0.25:1.0,2,2,1:256,[],11,100,25,2,5);

Apr 11 : Final Testing (ICCV 2007)

Tests (running on different machines):

308>> tree2_30_2_5b= maketree5(2,2,30,25,200,plevels,2,2,1:256,[],11,100,100,2,5); share april_11_morn tree2_30_2_5b
307>> tree2_20_2_5= maketree5(2,2,20,25,200,plevels,2,2,1:256,[],11,100,100,2,5); share april_11_morn tree2_20_2_5
123>> tree2_20_2_5b= maketree5(2,302,20,25,200,plevels,2,2,1:256,[],11,100,100,2,5); share april_11_morn tree2_20_2_5b
320>> tree2_10_2_5= maketree5(2,2,10,25,200,plevels,2,2,1:256,[],11,100,100,2,5); share april_11_morn tree2_10_2_5  
103>> tree2_10_2_5b= maketree5(2,2,10,25,200,plevels,2,2,1:256,[],11,100,100,2,5); share april_11_morn tree2_10_2_5b 
105>> tree1_10_2_5= maketree5(2,2,10,25,200,plevels,1,10,1:256,[],11,100,100,2,5); share april_11_morn tree1_10_2_5
303>> tree1_10_2_5b= maketree5(2,2,10,25,200,plevels,1,10,1:256,[],11,100,100,2,5); share april_11_morn tree1_10_2_5b  
317>> tree0_05_2_5d= maketree5(2,0,10,25,200,plevels,0,0.5,1:256,[],11,100,100,2,5); share april_11_morn tree0_05_2_5d
106>> tree0_05_2_5= maketree5(2,2,10,25,200,plevels,0,0.5,1:256,[],11,100,100,2,5); share april_11_morn tree0_05_2_5
302>> tree2_15_2_5= maketree5(2,0,15,25,200,plevels,2,2,1:256,[],11,100,100,2,5); share april_11_morn      tree2_15_2_5

Apr 10

Original Datset + 4 New Categories (for Ryan)

This time the new categories are

  • 054.diamond-ring
  • 062.eiffel-tower
  • 131.lightbulb
  • 027.calculator
load ~/256/ryan/trainIndex.mat
load ~/256/hist8b/1/hist075_004_200.mat
ftrain= ftrain(trainImages);
newcats={'253.faces-easy-101','232.t-shirt','145.motorbikes-101','251.airplanes-101',...
'054.diamond-ring','062.eiffel-tower','131.lightbulb','027.calculator'};
cd ~/256/sift8
ftest= {};
for cat=newcats,
  x= pickfiles(cat,350,0);
  ftest= {ftest{:} x{:}};
end
% ftrain= {ftrain1{:} ftrain2{:}};
for i=1:length(ftrain), ftrain{i}((end-2):end)='mat'; end
makehist7({},'~/256/ryan','iccv',ftrain,ftest);

For some reason I had to change the match syntax from last time to: makematch10(1,'iccv.mat',{},{});

Running now on vision310.

Apr 6 : Tests

I need two tests more than any other at this point:

  • comparison of random, montecarlo and spectral clustering
    • at least 1 level of recursion
  • for best method above, comparison of Ntrain=10,20,30
    • at least 1 level of recursion

Apr 5 : Parameter Fishing

Assume you have first shared t={} and plevels in repository 1.

cd ~/256/hist8b/1/topdown/[machine name here]
take 1; count=0;
for i=2, for j=(i+2):10, 
count=count+1; [i,j,count], 
t{count}=maketree5(1,10,25,200,plevels,2,2,1:256,[],11,100,25,i,j);
end; end

Running i=1,2,...etc on different machines, it is hard to distinguish between them with all the noise. So I think 25,1,5 is probably an ok compromise: performance is pretty good and speed is ok too. 25,2,5 is probably also alright, and about 40% faster.

Scripts like checkall `checklist`: and roundup help me figure out where to go.

Apr 4 : Simplified Code (continued)

groups=cell(3,1); cascade=cell(3,1);
for i=1:3, i, groups{i}= makegroups5(10,25,200,2,2,1:256,[],11,100,10,2,3+i); end  
for i=1:3, i, cascade{i}= makecascade5(groups{i},.02:.02:1.0); end

Meanwhile I'm running 3 copies of this overnight, to check their consistency:

groups= cell(10,10);
cascade= cell(10,10);
for i=3:-1:1, 
  for j=4:8, 
    [i j], 
     groups{i,j}= makegroups5(10,25,200,2,2,1:256,[],11,100,25,i,j); 
     cascade{i,j}= makecascade5(groups{i,j},.02:.02:1.0); end; end
  end
end

In the current syntax this is viewed like so:

clf; labels={};
labels=viewlevels5(3,labels,cascade{3,4:8}); 
labels=viewlevels5(2,labels,cascade{2,4:8});
labels=viewlevels5(1,labels,cascade{1,4:8});

These plots indicate that ntrial,nver,nkeep=25,2,6 has better performance than 25,2,5.

Apr 3 : Simplified Code

maketrees5
Like makelevels4 except allows for recursion, and calls makegroups5, makecascade5 and makelevels5 separately.
makegroups5
Generates 3rd "null" category, instead of makecasacade5. Add phylogenetic method and probabilities as an option.
makecascade5
Use probabilities instead of margin to assign groups 1, 2 and null. Evaluate all the test files for their margins (probabilities?) but do not generate final performance stats.
makelevels5
Break out the 2nd part of makecascade5, which evaluates the performance for a range of leakage values (and the below cascade information).
makesvm14
Generates probabilities. Subdivide into makesvm14 and getsvm14, so that makecascade5/makegroups5 don't have to reload match information every time? (this may not be worth the effort)

One extra thing is that, for full-scale testing, it might be useful to have all of these routines generate sensible outputs for a range of ntrain,ntest,nwords.

Apr 2 : What's Left To Be Done For ICCV Paper

Tests

  • Caltech-101 vs. Caltech 256 comparison
    • Range of Ntrain
    • Range of Leakages
    • methods:
      • random
      • monte carlo
      • spectral clustering
      • phylogenetic?
      • human?
  • Phylogenetics
    • Generate trees I'm happy with
    • Decide how to subdivide
    • Add to tests above
  • Affinity matrix
    • Confusion vs. probability?
    • Optimal N_train?
  • SVM
    • one-vs-all
    • one-vs-one
    • DAG??

Plots

  • Which taxonomy method works best?
    • Range of Ntrain using Caltech-256
    • Random, Monte, Spectral, Human, Phylogenetic
    • For 1,2 and 3-level hierarchies - recursively
    • Color: identifies method
    • Axes
      • Top: leakage vs. performance
      • Mid: leakage vs. test speed
      • Bot: test speed vs. performance
  • How does speed/performance scale with Ntrain?
    • Fixed method (spectral) using Caltech-256
    • Range of Ntrain
    • Color: identifies Ntrain
    • Axes : normalized speed vs. performance
  • Different datasets sizes
    • Fixed leakage value
    • The more categories, the better this works
    • Caltech-256, Caltech-256(N)
    • Overlay computation speed
  • Overall performance 101/256
  • Hierarchies from confusion matrix
    • Optimal N_train?
  • Pretty phylogenetic taxonomy
    • Show a bad version too, and why it is bad?

Text

  • text for Alex!
  • Theoretical prediction of speed increase
  • Different methods explained
  • Conclusions
    • Dramatic speed increase with very little performance drop
    • Which hierarchy method works best?
    • For this method, the more categories, the better!

Schedule

Apr 4
Submit abstract. Test recursive hierarchy code. Tests for optimal Nver for different values of Ntrain. Create reasonable phylogenetic taxonomy. Test converting margins into probabilities (new version of makesvm). Add diagram illustrating cascade, with accomplanying text.
Apr 5
Confirm abstract. Test: probabilities better than margin? Full-scale recursion tests on Caltech-256, Caltech-101 (only 3 out of 4 methods). Finalize phylogenetic taxonomy. How about Caltech-N?
April 6
Test Caltech-N. Figure out how to use phylogenetic taxonomy. Add taxonomy figures.
April 7
Full-scale Caltech-N tests. References, such as previous hierarchical approaches.
April 8
Finalize paper. Comments from Pietro.

Mar 24 : More Results (ICCV 2007)

Mar 23 : Double-checking the results.

Mar 22 : Results (ICCV 2007)

Mar 21 : Software For Large-Scale Testing

Rapid Testing Procedure

Replacing:
groupmode=2; % or groupmode=1
global train_mode train_fraction
train_mode=1;
train_fraction=0.33; 
indx= 1:25;
leakage(indx)= (100-4*indx)/100;
p=leakage*0;
t=leakage*0+9e9;
for i=indx;
   [p(i),x,t(i)]=makelevel4(1,10,25,200,...
      groupmode,leakage(i),1:256,100,2,5);
   fprintf('\n\nPLEVEL %d (%f) complete\n',i,leakage(i));
   clf;
   ax=plotyy(leakage,p*100,leakage,25*256./t);
   set(get(ax(1),'Ylabel'),'String',...
      'Performance (%)');
   set(get(ax(2),'Ylabel'),'String',....
      'Testing Speed (images/sec)');
   drawnow;
end
With:
trainfast(yes,0.333);
leakage(indx)= 0.00:0.04:1.00;
level=makelevel4(1,10,25,200,...
   groupmode,leakage(i),1:256,100,2,5);
viewlevel4(level);

Software Modifications Required

  • makesvm12
    • Can take a learner argument in place of classes
    • This way you can re-use the learner many times
    • In getlearner12 replace net with learner.
  • makecascade4
    • Train once, test many times using the new makesvm12
    • Regenerate performance over a range of leakage values automatically
  • makegroups4
    • mode=0 construct random groupings
    • Future groupings to add
      • Phylogenetic
      • Human (from Caltech-256)
      • Wordnet?
  • trainfast
    • Replace global variables (yuck) with persistent mex variables
  • makelevel4
    • Generate new match files at each level
    • Implement recursion for arbitrary number of levels
  • viewlevel4
    • Construct the above two-axis plotyy plot automatically
    • Two modes threshold vs. performance,speed or
    • Speed vs. Performance with threshold labels
    • Multiple experiments with automatic legened?
    • The details: what is happening at each levels!
      • Make this a separate routine?


Mar 20: Performance vs. Testing Speed (ICCV 2007: Paper #1)

Initial Testing Procedure

[l,t,p]= deal(zeros(101,1),zeros(101,1),zeros(101,1));
for i=101:-1:1
   l(i)= (i-1)/100;
   [y,x,cmats,c,ttest]=makelevel4(2,10,25,200,2,l(i));
   p(i)= mean(diag(c))*100;
   t(i)= ttest;
   fprintf('\n\nPLEVEL %d (%f) complete\n',i,l(i));
   plot(l,p,'bs'); hold on; 
   plot(l,t,'rs'); 
   xlim(-0.02,1.02); drawnow;
end

Mar 17 : Better Taxonomies With Forced-Guessing?

The Idea

Instead of using the usual confusion matrix, always force it to make the nearest guess - but never the correct one. You could call this the Guessing Matrix.

This would look like a normal confusion matrix with all the diagonal elements equal to zero. One major difference is that you are now teasing more information out of the model by always forcing it to guess. You are telling the model, "If you couldn't get the best model, which would you choose instead?"

Implementation

The procedure would be just the same as normal, except for one thing. For each row of the matrix, the 256 one-vs-all classifiers are replaced with a 255 one-vs-all classifiers. You always leave out the class corresponding to that row.

No More Orphans Running Around On The Tree

Thinking about this gives me a clue as to why easy categories sometimes tended to cluster together in the taxonomy. With these categories all the affinity information is absorbed into the diagonal element so that the affinity with other items is artificially reduced. You are saying, "these categories aren't close to anything but themselves", making them antisocial. They are left as taxonomic orphans with no place to fit in (but then why do they cluster together?).

Using the best-guess matrix instead of the confusion matrix would solve that, making the normalization of the rows more consistent.

Come back to this issue next time I'm converting confusion (affinity) to distance for the taxonomic software.

Mar 16 : Meeting with Pietro

Mar 15 : Problems with Blank Images

This was for Alex's ICCV paper:

Category-3 Duplicate Removal

Trying to remove duplicates in /common/holub/images256/*/3 but the nearly-blank images are causing dupkill to crash. I concatinate the dupkill stdout of several different process to ~/tmp/dupkill and then:

cat dupkill | awk '
{ 
  if ($1=="entering") cat=$2; 
  if (x==1) {x=0; print cat,$3} 
  if ($2=="0") x=1; }' | sed "s/ '/\//" | sed "s/'//" > ~/tmp/blankfiles
cat ~/tmp/blankfiles | sed 's/\//\/3\//' > ~/tmp/dupblanks
cat  ~/tmp/dupblanks | xargs xv
cat  ~/tmp/dupblanks | xargs rm -rf 

I just wanted to save this syntax in case I need it later. For the record I deleted

fern/3/0118_goog_3.jpg                     frying-pan/3/image_0311_3.jpg
horse/3/image_0219_3.jpg                   human-skeleton/3/0117_goog_3.jpg
jesus-christ/3/image_0049_3.jpg            baseball-glove/3/image_0344_3.jpg
billiards/3/image_0444_3.jpg               butterfly/3/goog_0296_3.jpg
cactus/3/0066_goog_3.jpg                   CD/3/image_0255_3.jpg
ladder/3/image_0184_3.jpg                  lightning/3/goog_0172_3.jpg
megaphone/3/goog_0150_3.jpg                microscope/3/0039_goog_3.jpg
refrigerator/3/image_0093_3.jpg            sextant/3/0151_goog_3.jpg
sheet-music/3/0104_goog_3.jpg              tripod/3/0189_goog_3.jpg
chess-board/3/image_0275_3.jpg             cockroach/3/image_0350_3.jpg

Update: 2 days later I had to delete 10 more files.

Mar 13 : Meeting with Antonio Torralba

Some notes:

  • Try sorting 256 by most centered to least centered
    • if someone suspects centeredness is corrupting things, you can jackknife your tests
  • Using LabelMe dataset
    • They're up to about 130,000 images annotated
    • MATLAB toolbox is more extensive than their faq and documentation might suggest
    • No object is ever manually segmented twice by different users
      • Does this mean we can't get hierarchy info as with WhatWhere?
    • Useful WordNet tools??
  • Clutter
    • Not clear that the negative of whatever is labeled is really clutter
      • Different people label data more exhaustively
    • Is clutter always just the negative of the set of (things we care about)?
    • Must we always have examples of (things we don't care about)?
    • Or should we define it using only a set of (things we care about)
      • Exploit taxonomy
    • Is LabelMe dataset useful for defining clutter?
  • Hierarchy

Some references:

Mar 4 : Software

With experiments usually running on a dozen machines at once, it gets cumbersome to send all the data back to one place (where you can plot the results). I wrote routines look, share, take and wipe to finally deal with this once and for all. Also worked on makecascade.

Mar 3 : Testing

Here's an example of testing the first level of the cascade, which guesses whether the image is in

  • group 1
  • group 2
  • group 3 = neither
global train_mode train_fraction
[train_mode,train_fraction]=deal(1,.33);
[y10,x]=makelevel4(1,10,25,200,2,0.01:0.01:0.99,1:256,25,2,5);

It is worth using ntrails=25 instead of ntrails=10 (noticeable improvement).

Mar 2 : Are Binary Trees OK? (ICCV 2007 paper #1)

My whole method for the ICCV2007 paper assumes that it is ok to restrict outselves to a binary decision tree - for simplicity. My previous experiments suggest that this is fine: binary trees have the highest "goodness" value when using spectral clustering.

But it isn't so clear. Could we create more natural groupings with 3 or more branches at each level? This is what I'm trying to investigate with the new routine viewgroups4.

All in all, after inspecting the below groupings I'm somewhat encouraged that dividing into two groups recursively yields similar results to dividing into 4 groups in one shot.

Plus this new routine just gives me a much better tool for visualizing grouping.

Below: creating 2-way (left) and 4-way (right) groupings of confusion matrix, based on spectral clustering method.

makegroups4(10,25,200,2,2,1:256,[],11,100,25,3);
viewgroups4(10,25);
makegroups4(10,25,200,2,4,1:256,[],11,100,25,3);
viewgroups4(10,25);


Observations:

  • Using ntrials=25 gives much more reasonable-looking categories than ntrials=5.
  • If you unify groups 1+4 and 2+3 from the right, you get approximately the groups on the left.
    • Main differences is that some categories like transport and structures tend to calve off and defect to the other group.
    • Is this to make goup sizes more uniform?


Mar 1 : More Speed Tests

See caption on yesterday's plot for details. This adds extra values of train_fraction, and also shows the residual of each curve with the baseline train_fraction=1.0.

New Software Syntax

Usage: makegroups4(10,25,200,2,2, 1:256,[],11,100,10,3); where the two red arguments are passed as the 2nd and 3rd arguments to getgroups. In this case, we're selecting spectral clustering and division into two groups, which will probably become the default.

Detailed Speedup Tests (dashed lines)

See (right) a slightly expanded version of yesterday's plot, where I try 25, 33, 50, 75 and 100%.

  • For quick runs the sweet spot seems to be about 33%
    • This yields 6 times the speed
    • Performance difference of < 1%
  • For a semi-final run 75% gives
    • Speed increase more than 2 times
    • Performance difference < 0.2%

Improvement (solid lines)

See solid lines. The difference is that I pick a single random set of negative examples instead of re-generating this 256 times. This actually performs better and takes less time.

Now 25% (red) looks quite tolerable for testing but it isn't much of a speed increase. The sweet spot still seems to be train_fraction=.33 (green).


Feb 28 : Speeding Up Training Phase Of SVM

Summary

I used to be able to test 12.6 images/sec. Now we're up to 372 images/sec using changes from the previous 2 days.

  • Testing is about 30 times faster and with no noticeable drop in performance.
  • Training is about 3-8 times faster with little degradation in performance
    • The exact speed increase depends on Ntrain
    • Absolute performance difference < 1.0 % for all Ntrain

This is pretty dramatic.

Don't Need To Use All The Training Data

Using only 1/3, 1/2 or all the negative training examples in the one-vs-all classifier yields 23.5, 24.0 and 24.5% performance in 40.1, 62.2 and 139.2 seconds.

In other words, at Ntrain=10 you can speed training up by 3.5 times and keep 96% of the original performance. For higher Ntrain the performance differences is even larger. So this should be a very useful mode, especially when I'm just testing some code out for the first time.

Currently using global variables train_mode and train_fraction to control this behavior of matlab_code/svm_v0.55/@maxwin/train.m example:.

Here's a more comprehensive benchmark. It is the usual Ntrain vs. performance curve for the usual Spatial Pyramid Matching (SPM) kernel trained on a one-vs-all SVM. I'm running [pmean,pstd,pall]= multisvm12('~/256/hist8b',1:8,5:5:30,25,200,1:256); with all the training data vs. 1/3 of the training data. Performance drops 0.9% while the speed improves 590%. Compute times include both the training and testing phase.
>> global train_mode train_fraction
>> train_mode=1; train_fraction=1/3; 
>> makesvm12(10,25,200,1:256);
Getting classes... automatically
Symmetrizing
Training... time:  22.9 s
Keeping...         18.7 % of support vectors
Testing...  time:  17.2 s

Performance= 0.234687

>> train_mode=1; train_fraction=1/2;
>> makesvm12(10,25,200,1:256);
Getting classes... automatically
Symmetrizing
Training... time:  41.0 s
Keeping...         14.1 % of support vectors
Testing...  time:  21.2 s

Performance= 0.240469

>> train_mode=0;
>> makesvm12(10,25,200,1:256);
Getting classes... automatically
Symmetrizing
Training... time: 113.9 s
Keeping...          8.6 % of support vectors
Testing...  time:  25.3 s

Performance= 0.244844

Feb 27 : Speeding Up Testing Phase Of SVM

I'm now getting dramatically (more than 10 times) faster testing speed with no drop in performance, just by stripping out support vectors whose weight is less than tolerance. This required creating a @maxwin/strip routine to call the corresponding @svc/strip routines.

Feb 26 : Cascade Testing (ICCV 2007 paper #1)

Cascade Software : 3rd Step

Indexing has been half the battle...

This part will be much less complicated if I can redesign SVM routines to deal with non-consecutive class numbers. Ie, classes can be "3,4,13,15,45" for a particular group. Otherwise it becomes the job of makegroups to keep repeatedly re-mapping the category numbers into consecutive integers - which you then have to untangle at the end.

Checked in a new routine: makesvm12 which builds on makesvm11.

Use replace and reindex, which is tested separately in reindex_test.m (this is a good example of how both routines work). Here is the relevent segment:

  x1= ceil( rand(256,1)*10 ); 
  x2= ceil(rand(256,1)*10);
  map= reindex(x1,x2);
  
  % new indices should be consecutive
  y1= replace(x1,map);
  y2= replace(x2,map);

  % reconstruct the original indices (as a test)
  rmap= fliplr(map);
  z1= replace(y1,rmap);
  z2= replace(y2,rmap);

This is run hundreds of times and then a series of tests follow, to make sure it worked.

Feb 23-24 : Cascade Testing (ICCV 2007 paper #1)

UNTIL tree is L levels deep
1. Make sub-groups 0,1 at current nodes
2. Pass test images down branches 0,1 using leakage threshold P
THEN
3. Classify test images in all nodes
END

There are 2 steps each time you call makelevel:

makegroups (160 sec @ Ntrain=10)
Create two mixed classifiers 0 and 1
makecascade (10 sec @ Ntrain=10)
Assign each test image to one of the two groups (0 or 1), or neither (2)

Finally when you're done recursively making as many levels you need,

makenodes (time/performance depend on parameters)
Now classify test images in each leaf separately (10 sec)
This is now faster since you only have to try roughly half the possibilities

Let's see how it is all working together, and see what the speed/performance looks like.

Note: the makelevel routine will run makecascade recursively. For now there is only one level, so makecascade and makelevel do the same thing.

Cascade Software: First 2 Steps

The results of makegroup4:
load group010_025_200.mat;
viewgroups(Ctrain,strains); 
colorbar;
strains
         [1x155 double]    [1x95 double]

Syntax from Jan 23 seems to still work after change to makesvm11 that I made on Feb13. Note addition of a new 1st parameter stage which tells it how much it needs to re-run (saves time). With the optimal parameters I found before, that's:

[y10,x]=makelevel4(1,10,25,200,...
2,0.01:0.01:0.99,1:250,10,2,5);

What makelevel4 currently does:

  • Generate mixed groups 0 and 1
    • Runs makegroups4 in about 160/480/760 sec
    • In this example, group sizes were 155 and 95
    • ntrial=10: repeat and average ntrail times.
      • higher value of ntrial could help?
    • nver=2: leave-nver-out confusion matrices w/ training set
    • nskip=5/2/0: randomly throw nskip rows/columns out of each training category
  • Assign each test image to one of the two groups, or neither
    • Runs makecascade4 in about 10 sec
    • Uses a range of leakage thresholds plevel=0.01:0.01:0.99

Best parameters

Turns out there is no benefit to using anything less than nskip=5.

I just did two runs at each value so this might not be exactly true. Should to ~100 runs overnight and average them some time. There is a fair amount of variance, ~2-3%.

For now I'll stick with the above syntax ending in 10,2,5. Good enough.

Feb 22 : ICCV + Tech Report

Ncategory2.png

ICCV 2007

More makecascade programming.

Tech Report

This illustrates how the Caltech-256 really is more challenging than the Caltech-101. Pyramid matching performance on the Caltech-256 is worse than on the Caltech-101 even when using the same number of Ntrain.

See Nov 16 entry below (in this notebook). Is this too confusing to explain?

Need to clean up axes.

Feb 21 : ICCV/NIPS 2007 Plan

Vision FebMar 2007.png

Made a chart of 4 papers I could work on, plus some other topics.

This lists two ICCV 2007 papers and two NIPS 2007 papers. Right now paper #1 is taking all my time. Need to finish this before thinking about paper #2.

Need advice from Pietro on where to go from here. Andrea set up meeting for Tues Feb 26 10am.


Feb 20 : Odds and Ends

Performance of all 256 object categories using a typical pyramid match kernel in a multi-class setting using Ntrain = 30. Performance for each category corresponds to the diagonal entry of the confusion matrix. The ten best performing categories are shown in blue at the top left. The ten worst performing categories are shown in red at the bottom left.

Email Requests To Work On

[Dave Martin]
questions about caltech-256 protocol. Send him tech report.
[Marcin Marszalek]
wants hand-made caltech-256 hierarchy from paper. Export to XML?
[Pranab Mohanty]
wants to use our spatial pyramid matching code for generating pairwise affinity matrices. In tar format. Gulp-- this is going to take a little documentation. I can start with the documentation I wrote for Alex.

Tech Report

Found this useful postscript command for changing gamma, that I want to remember:

{ 1.5 exp } settransfer

Put it right after %%Endprolog to brighten figure 9 in the 256 paper.

There's a new "performance by category plot" which hopefully Alex will like better (see right). It is generated with ~/256/catsort/catsort2.m.


Feb 19 : Feedback from Pietro and Alex

From Pietro From Alex
Thank you Greg!

I took a quick scan and it looks fine.

Maybe just a couple of comments:

a) Measuring the performance of the `object' detector at 10% false accept rate is dangerous. We know that perhaps 1 in 100 or 1 in 1000 patches are objects. So: we should really measure the performance at 0.01% false alarm rate. In the conclusions we should say that this is a very tough challenge, but that it needs to be tackled in the next years.

b) presenting the confusion matrix for 250 categories is confusing: why not always stick to Caltech 256. Are the curves in Fig 12 describing reesults on 256 or 250 categories? Etc.

P.

From Alex:

Hi Greg! Looks great! Figure 8 took me a while to parse (maybe I'm just slow;) ) -- but maybe add a bit more to the caption? Might add something like this:

'Performance of all 256 object categories using a typical pyramid match kernel in a multi-class setting using Ntrain = 30. Performance for each category corresponds to the diagonal entry of the confusion matrix. The best performing categories are shown in the upper right. The worst performing categories are shown in the lower left. Vertical ... '

What's confusing is that at first I thought you were comparing the categories on the left ot those on the right -- then I realized it was just a list of all categories ...

Response to Pietro's question a)

Agreed. But how about if we quote 0.1% instead?

For one thing, we have so few clutter files that 0.01% is really beyond our measurement resolution.

For another thing, the pyramid match algorithm is fairly scale-invariant so I suspect we would not have to scan over so many windows. Other algorithms might have to scan extensively in both position and scale. We could just pick one or two fairly conservative window sizes and overlap them a little bit.


I probably picked 10% because I wanted high performance! But not entirely.

The other way to think of this is as just the first, fastest layer of the cascade. It cleans up the obvious cases so you can focus on harder cases.

Here's what I'm imagining, at least for the future: Cascade Feb19.png

Response to Pietro's question b)

Note: in the end I just stuck with 256 categories and threw out the whole 250 thing, for simplicity

I too am worried that we'll confuse people too much with the whole 250/256 thing. Is it just too complicated?

I was originally trying to make things less confusing. There's something in the Caltech-101 which always bothered me, which I wanted to clear up with the Caltech-256.

Faces-easy and Faces are essentially the exact same pictures: should we include BOTH categories in the final performance figures, or not? Alex throws one out I think. I do too. Why penalize performance because the classifier can't discriminate between two things that are essentially the same?

But there is no agreement on this. Some other people probably keep both categories. Can we honestly compare our results with there's? A small effect, but still bothersome. We're not all on the same page.

Ditto for cougar_body and cougar_face, and crocodile and crocodile_head, flamingo and flamingo_head. These we never throw out. But most of the flamingo pictures have a head, right? Is the classifier really wrong to choose flamingo_head?

In 256 I wanted to use the indices to clarify this, and separate the 6 confusers from the rest of the topics. The confusion of these 6 (ie. the generality) is still an informative tool. It helps people imagine what their algorithm is really doing. It's a sanity check. But those 6 confusers just doesn't seem like a fair test of performance.

Will the 6 confusers confuse people so much that it causes benchmarking chaos? Maybe. I'm just pointing out that we already have confusers in the Caltech-101. We just mix them all up with the rest of the set.

So now I'm not sure what to do. Help??

Response to Alex

I've added your improved caption. I just couldn't fit so many labels on one side of the plot! It is a little confusing. Another option is to

  • arrange them all in a big circle
  • keep left and right side the same color (so that they don't seem like different things)

I'll fool with it a bit and see what happens.

Feb 15-17 (Thu-Sat) : What is the interest detector doing?

The Story So Far

Last time (Feb 12) I said I wanted to try these tests:

Is it overtraining on our choice of positive and negative examples?
Use Caltech-101 as positives and Caltech-6/Caltech-101 clutter as negatives.
Is it relying on geometry, such as centeredness of the objects?
Use bag of words instead of pyramid matching.
What do the false positives / false negatives look like?

Let's tackled that last one...

False Positives And False Negatives

Call "positive" something that is interesting and "negative" something that is uninteresting.

Continuing from the earlier syntax,

pos= 2-truths;
neg= ~pos;
true=  (rating>=0) == pos;
false= (rating< 0) == pos;
fpos= find( rating>=0 & neg );
fneg= find( rating< 0 & pos );
[tmp,indx]= sort(rating(fpos));
mosaic(4,10,files,fpos(indx),1.2,rating);
[tmp,indx]= sort(rating(fneg));
mosaic(4,10,files,fneg(indx(1:40)),1.2,rating);
Classifier mistakenly said it was interesting
Classifier mistakenly said it was just clutter

Analysis

There's seems to be a pattern. Many of the false negatives bear a resemblance to one or more of the clutter images (see plot, above to the right):

cd ~/256/images
fclutter= pickfiles(getcategories(257));
fcompare= { fclutter{[85 128 145 396 471 617 647 734]} files{fneg(indx([39 15 36 2 25 17 38 4]))} };
mosaic(2,8,fcompare,1:16,1.2);
Clutter pictures on the top shown with some of the interesting pictures that the classifier mistakenly identified as clutter


With these sort of examples (top row) floating around in the clutter database, most of our false negatives (bottom row) are pretty understandable, no?

Would boosting help fix this (leave-one-out of training data)? Doubtful.

Ultimately we may need a human in the loop. Stop thinking of our training sets as the final word. The training set is merely a starting point from which to learn what is and isn't clutter. Then the algorithm needs to ask questions. Do children have fixed training sets?

I guess this is really what Alex's CVPR07 paper is all about?


Summary

For many of the above false negatives I can find direct analogs in our clutter category which may have caused the false detection.

To me this test implies that many of the misidentifications are really just due to imperfections in our clutter database. We call all these cropped images clutter even though there are examples of images that could be interesting or clutter depending on your point of view.

In other words: an interest classifier is only as good as its training data .

We get so wrapped up in these silly fixed training sets. Like they are the word of god. We expect our classifiers to (some day) generate >95% performance. But that's ridiculous. The above plot really makes it clear for me: you'll never get there. You can't get there. Not without a human to correct the (inevitable) defects that are in the training set.

Fixed training sets are OUT. Learning is IN.

Feb 14 (Wed): Comparing "interesting" detectors (con't)

Interest-Detector Shootout

For the 256 paper I settled on 3 examples that sort of tell the story:

Detector C - Bad
standard Ntrain=30 for all categories, multi-category classifier
Detector A - Good but too Slow
same as C except Ntrain=512 for background category. The classifier needs more examples of this to do a good job.
Detector C - Reasonable Performance and 200 times Faster than A and C
2-category classifier shows you can roll a bunch of "interesting" images into one big category. Some people have told me this wouldn't work! And maybe it doesn't. But I'm somewhat encouraged.

This from the 256 paper (see ~/256/paper/roc/roc2.m):

ROC Curve for three different interest classifiers described in section 4.5. These classifiers are designed to focus the attention of the multi-category detectors benchmarked in Figure 10. Because Detector B is roughly 200 times faster than A or C, it represents the best tradeoff between performance and speed. This detector can accurately detect 86% of the interesting (non-clutter) images with a 10% error rate. In other words, 10% of the images classified as interesting will instead contain clutter.
Extracted from The 256 Tech Report

Summary

Detector C is encouraging. But is it overtraining on something silly, like the average scale of the objects or the photography style of Stephen Shore?

Feb 13 (Tue): Comparing interest detectors

Code

Modified makesvm11 so that it could truncate match file. This makes the third line possible without fully regenerating the histograms/match files (see below).

Syntax (for the 3 methods above):

[FPOS,TPOS,marg12]= makeroc2(0,[2],'~/256/roc/1','30_4_512_256',1:256,257,30,4,512,256); % A==green
[FPOS,TPOS,marg12]= makeroc2(0,[1],'~/256/roc/1','512_256',     1:256,257, 2,1,512,256); % B==black
[FPOS,TPOS,marg12]= makeroc2(2,[2],'~/256/roc/1','30_4_512_256',1:256,257,30,4, 30,256); % C==blue

Modified makeroc2 syntax:

>> help makeroc2
  MAKEROC2

  usage: function [FPOS,TPOS,marg12,guess,truth]=
  makeroc1(stage,method,subdir,suffix,class_indx,bkgd_indx,ntrain1,ntest1,ntrain2,ntest2);

     stage 1 and above generates histograms and match files
     stage 2 and above generates svm:
             method 1 runs 2-class
             method 2 runs multi-class
     stage 3 generates the ROC curve 

Feb 12 (Mon) : What is the interest detector doing?

Here the red numbers are the difference in the margins, which we will call the interestingness. Least and most interesting images are sorted from top-left to bottom-right:

[FPOS,TPOS,marg12,guesses,truths]=makeroc1(2,[1],'~/256/roc/1','512_256',1:256,257, 2,1,512,256);
rating=marg12(:,1)-marg12(:,2); 
[r,i]= sort(rating);
load match512_256.mat ftest
files=ftest;
for j=1:length(ftest),
    files{j}(end-(2:-1:0))='jpg';
end
cd ~/256/images
mosaic(4,10,files,i(1:40),1.2,rating);
mosaic(4,10,files,i(end-fliplr(1:40)-1),1.2,rating);
Least interesting images, according to our classifier
Most interesting images, according to our classifier


Histogram Feb12.png

Here is a histogram of the interestingness rating. The red lines show where these two sets of images are taken from, in the overall distribution:


hist(rating,50);
hold on; vline(-2.38,'r-');
hold on; vline(3.01,'r-');


What Does It Look Like It's Detecting?

It is not yet clear what the interesting classifier is actually detecting. Here are a few guesses along with ways of testing each guess.

  • Low or high frequency? It doesn't look that way to me. Could measure how sensitive the classifier is to scale.
  • Common features? Ball-like edges, boxy corners, shadows... what?
  • Proximity to center? Try a bag-of-words model which does not know where the center is... do I still get the same results?
  • Specific object categories from the 256? Try it on 101 images instead.
  • Picture style of Stephen Shore? Things like brick, sky, etc? Try using Caltech-6 for clutter instead.
  • Artifacts such as white frame around object.
  • Any other theories?

Summary

The best things I can think of to try next are:

Is it overtraining on our choice of positive and negative examples?
Use Caltech-101 as positives and Caltech-6/Caltech-101 clutter as negatives.
Is it relying on geometry, such as centeredness of the objects?
Use bag of words instead of pyramid matching.
What do the false positives / false negatives look like?


Feb 8-9 (Thu-Fri): Generality

This from the 256 paper (see ~/256/paper/results/generality.m):

Selected rows and columns of the 256x256 confusion matrix M for Spatial Pyramid Matching and Ntrain=30. Matrix elements containing 0.0 have been left blank. The first 6 categories are chosen because they are likely to be confounded with the last 6 categories. The main diagonal shows the performance for just these 12 categories. The diagonals of the other 2 quadrants show whether the algorithm can detect categories which are similar but not exact.
Extracted from The 256 Tech Report

Feb 6 (Tue) : Caltech-256 Paper

Incorporating the below ROC curve into the 256-paper, and writing about it.

Feb 5 (Mon) : Interest detector seems to be working

ROC Curve for an interest super-classifier based on a Spatial Pyramid Matching kernel, which is trained on 512 positive example (2 images each from categories 1...256) and 512 negative examples (from category 257).
[fpos,tpos]= makeroc1(1,'../roc/2','512_256',1:256,257,2,1);
semilogx(fpos*100,tpos*100);
set(gca,'xticklabel',{'0.1','1.0','10','100'});

By changing the first flag "1" to "2" you can keep from regenerating the histograms (if this has already been done).

I ran this on 6 different machines and averaged the results to get the plot to the right (which is now in the Caltech-256 tech report). See ~/256/paper/roc/roc.m.

This curve shows, for example, that we can accurately detect 86% of the interesting (non-clutter) images with a 10% error rate. In other words, 10% of the images classified as interesting will actually contain clutter instead.

Putting this plot in the Caltech-256 Tech Report.

So the interesting-ness classifier seems to be working, although it is not clear yet exactly how it defines interesting. Something intelligent, or something trivial?

Feb 2-3 (Fri-Sat) : Caltech-256 Paper

Finishing up Caltech-256 paper. New figures. Completely reordering the sections.

Jan 31 (Wed) : ROC Curves for Multi-Class Classification

The Code

Checked in new roc/makeroc1.m.

Jan 30 (Tue) : Using Clutter Category To Implement Attention

Here's the main idea.

Consider the example of a Mars rover that moves around in its environment while taking pictures. Raw performance only tells us the accuracy with which objects are identified. Just as important is the ability to identify where there is an object of interest and where there is only uninteresting background. If background is constantly misidentified as an object, the rover cannot begin to understand its environment.

The rover example also illustrates how the meaning of the word background is strongly dependent on the environment and the application. Our choice of background images for Caltech-256 is meant to be reflect a variety of common (terrestrial!) environments.

Here we propose a metric that tests the ability of a classification algorithm to identify a lack of interesting categories. It is based on the familiar ROC curve where the percentage of true positives is plotted against the percentage of false positives. In single-category detection the meaning of true positive and false positive is unambiguous. Imagine that a search window of varied size scans across an image, employing some sort of bird classifier. Each true positive marks a successful detection of a bird inside the scan window while each false positive indicates an erroneous detection.

What do positive and negative mean in the context of multi-class classification? Consider a two-step process in which each search window is evaluated by a cascade of two classifiers. The interest classifier detects either 1) an object of interest or 2) background. The background images are thrown out. Finally, the usal multi-class classifier identifies the specific object category.

Our ROC curve measures the preformance of the interest classifier only. A false positive is any clutter image which is misclassified as containing an object of interest. Likewise true positive refers to an object of interest that is correctly identified as such. Here object of interest means any classification besides clutter.

Our interest classifier is trained on an equal number of clutter and non-clutter images. First take Ntrain images at random from each category 1...256. These are assigned to class 1: interesting. Next a set of 256 . Ntrain images are chosen from background class 257. The threshold is adjusted by weighting the margins returned by the SVM classifier, resulting in the ROC curve. Different curves are generated by changing the value of Ntrain. We will refer to this classifier as a super-classifier because it uses examples from many different object categories. This corresponds to the super-category that is called object of interest.

Jan 24 (Wed) : Slides Slides Slides

Behold my beautiful slides for Pietro. He didn't end up using them but they could come in handy for some future Caltech-256 talk.

Jan 22-23 (Mon-Tues) : Cascade Testing vs. Ntrain (ICCV07)

At this point I've discovered some things (and fixed some problems) in the topdown v4 and svm v11 software, which ends up speeding things up by an order of magnitude. So here we go with testing:

Comparison

Jan22 fig.png

How best to compare different values of Ntrain? Below I make an arbitrary choice that, regardless of the training set size, only 5 training images will be used to create the confusion matrix for each of the 10 trials.

As a result each makelevel routine takes ~2.5 minutes to execute.

[y05,x]=makelevel4(5,25,200,2,0.01:0.01:0.99,1:250,10,2,0);
[y10,x]=makelevel4(10,25,200,2,0.01:0.01:0.99,1:250,10,2,5);
[y15,x]=makelevel4(15,25,200,2,0.01:0.01:0.99,1:250,10,2,10);
[y20,x]=makelevel4(20,25,200,2,0.01:0.01:0.99,1:250,10,2,15);

Summary

I'm picking fairly optimal ntrial,nver,nskip for Ntrain=10 but not for 5,15, and 20.

The above comparison strategy just has to be changed.

Jan 20 (Sat) : Even Faster Training

Fixed Two Speed Bottlenecks

The 154 MB file match020_025_200.mat seems to load off network in 15 seconds. So big files on the network drive are not the bottleneck. Could be memory - load matrices one at a time? Maybe this is not the bottleneck. How about memory? Try only loading one of the two matrices at a time, then clearing them?

No, I finally discovered the bottlenecks:

  1. Was symmetrizing the training matrix in makesvm11 inside 2 for loops.
  2. getlearner11 was rescaling/offsetting (by 1.0, 0.0) when it didn't have to

This 2nd part turns out to have been the real time-waster.

Now the 3.3 min training take 2.3 min instead.

Keep Going

Pushing even further, we can train in 1.5 minutes with about a 1% drop in performance. The parameters here are ntrial,nver,nskip=10,1,7 and performance is 86.4% @ 50% leakage. This probably is not worth it, since the makecascade execution time begins to dominate.

It is worth noting that the latest (tuned up) routines get 75% or better performance @ ntrain=10 for classifying test images into 1 of the 2 super-groups even when plevel=0  ie. 100% leakage. This means that you force yourself to classify each and every test image. Of course as plevel increases we can get even better performance, because you are shunting the hard-to-classify test images by making no classification at all on them. So 75% is merely a lower bound on performance.

makesvm11

Instead of removing nskip training images just once, I now do it repeatedly (and randomly) for each trial. Ran a few trials; this seems to increase performance. But it is hard to say without dozens of trials (because of the variance in performance between trials).

Also, I don't know if this matters, but I was not executing randomize at the beginning.

Jan 19 (Fri) : Faster Training: Why Confusion Can Be Good

Each curve represents performance for an independent trial. Blue curves used the usual training procedure that takes 25 minutes to find the best 2 super-classifiers. Red curves required only 3.3 minutes of training, by randomly throwing away half the training images in makegroups. For black and green curves see code to left.

I mistakenly used nskip in makegroups4 instead of makecascade4.

This lead to an interesting discovery: you can reduce the training set size from 10 to 5 when creating super-groups 1 and 2, improving the training speed by 7(!) with no reduction in performance.

See plot at right. Blue curves each required 25 minutes of training time, as compared to 3.3 min for the red curves. No noticable drop in performance.

[y_blue,x]=makelevel4(10,25,200,2,0.01:0.01:0.99,1:250,10,3);
[y_red,x]=makelevel4(10,25,200,2,0.01:0.01:0.99,1:250,10,2,5);
[y_green,x]=makelevel4(10,25,200,2,0.01:0.01:0.99,1:250,10,1,5);
[y_black,x]=makelevel4(10,25,200,2,0.01:0.01:0.99,1:250,30,2,5);

Explanation

Perhaps you need a high number of categories (ie context) to establish a hierarchy, more than you need a high number of examples from each category.

In fact a high number of examples may even hurt you! The confusion matrix is our one and only tool for establishing the distance between categories (and thus the hierarchy). If you have more training examples, you are likely to become less confused so that the hierarchy is harder to establish.

Alternative explanation: as you increase the number of training examples you encourage the confusion matrix to lock-on on finer and finer distinctions between images in categories. These details detract from the big picture. In other words, if our method is inherently imperfect, having a "blurry" picture keeps you from seeing these imperfections.

Summary

Drastic speed increase (training) with no noticeable drawback. How can this be?

I'm starting to understand something important here. We are used to thinking of confusion as being bad, because we want high performance. Thus a high number of training examples is good.

But for establishing a hierarchy, confusion is actually a good thing. So a high number of training examples can actually hurt you, and slow you down at the same time! .

Here's an example. Suppose you had an infinite number of training examples, and your confusion began to approach 0. With no confusion, there would be nothing to use to construct the hierarchy!

Have yet to try this with, say, ntrain= 5, 15 or 30 but it should be interesting to see.

Jan 18 (Thu) : Software

makelevel4

cd ~/256/hist8b/1; makegroups4(10,25,200,2,1:250,[],11,100,10,3);
[y308,x]=makecascade4(10,25,200,0.01:0.01:0.99,1:250);

is equivalent to the new (simpler) syntax:

[y308,x]=makelevel4(10,25,200,2,0.01:0.01:0.99,1:250,10,3);

While adding this macro, good place to freeze code and advance from version 3 to version 4.

So far only functional difference is that makegroups4 will allow you to increase the speed by using only a fraction of the training set during super-classifier training. This is done with the new last parameter, nskip:

[y308,x]=makelevel4(10,25,200,2,0.01:0.01:0.99,1:250,10,3,2);

The behavior here would be to throw nskip=2 images out of each class in the super-classifier.

Jan 17 (Wed) : Software

makesvm11

I froze makesvm10.m and created makesvm11.m specifically for adding a new capability: using only a small fraction of the available training set. This is really only useful when training on super-classifiers, where there is a large pool of training data and using all of it is too slow.

More self-consistent getlearner11 and getclassifier11 function names, in place of svmlearner10 and svmclassifier10.

future

On my TODO list of other useful tools and modifications:

  • File load routine automatically searches parent directories
  • Save match files in a more compact format
    • Run times often dominated by time to read off network drive
    • Method 1: remap singles to 8-bit precision
    • Method 2: reindex symmetric matrices
    • Reduce size by 2 or more?
  • Routine to automate putting current MATLAB figure on wiki
  • Routine to empty/add/plot to a shared scratch space
    • Run results on many machines
    • Quickly gather and plot results on one machine

Directory Issues

Small problem is that I cannot (currently) run all the routines in the same directory, because they all try to write and load their own version of group10_25_200.mat. For now I've created separate subdirectories for each machine, such as

cd ~/256/hist8b/1/topdown/308

At the moment, each sub-directory needs its own link to the match file:

for dir in 3* ; do 
   cd $dir
   echo $dir
   ln -s ../../match010_025_200.mat match010_025_200.mat
   cd .. 
done

Jan 16 (Tue) : Cascade Tuning (ICCV07)

More Parameter-space Searching

Blue,green and red represent parameters 10,3 10,1 and 5,3 respectively.


ntrial nver Performance
@ plevel=0.5
10 3 ~89%
10 1 ~86%
5 3 ~83%


So far I'd use 10,3 when in a hurry and 30,3 for final results.

These take on the order of 25 min and 1hr 20min (respectively).

Next want to experiment with only training super-groups on a fraction of the total data.

Jan 15 (Mon) : Cascade Tuning (ICCV07)

Further tests show that ntrail,nver= 100,3 performs 1 or 2% better than 30,3 while 30,5 performs about 1% worse (on average).

Jan 12 (Fri) : Cascade Tuning (ICCV07)

Same plot as Dec 28, now with correct priors on super-groups 1 and 2. The very low-performance curves from Dec 28 are now gone. These indicated unbalanced super-groups.

Plot at right indices that, for Caltech-256, splitting into one of 2 super-groups (level-1) can be achieved with about 91% accuracy at a leakage of 50%.

  • This is pretty encouraging. Certainly good enough performance to get started.
  • For Caltech-256 Spectral Clustering works best for choosing super-groups.
  • For Caltech-101 Monte Carlo Clustering works best. I'm assuming this is because Spectral Clustering is creating unbalanced groups?

Performance is now quite a bit improved, after properly accounting for the prior (ie. imbalanced sizes) of the two super-groups.

It would be interesting to analyze

  • Which categories have the most classification errors, and why?
  • Smaller training set (use a sub-set of the images from each category: speed? accuracy?)
  • Pairwise vs. one-vs-all
  • What if there is an n-way split instead of a 2-way split?

Jan 6-11 (Sat-Thu) : Studying

Studying for final part of quals.

I shall disappear.

Jan 4-5 (Thu-Fri) : Memory Problem Fixed

With the old matching kernels, I ended up having to flip them 90 degrees in makesvm10. They are so big that this causes the machines to swap or crash. This has been an annoyance for a long time.

Recomputing Kernel Match Matrices

Rerunning all the kernel matrices for every single experiment in ~/101/hist8b and ~/256/hist8b. Fortunately none of the machines are being used right now. Even on ~15 machines at once it's going to take about 3 days. But in the long-run it will speed up the experiments quite a lot.

Now makesvm10 will use half as much memory. This should eliminate the problem.

Jan 2,3 (Tue,Wed) : The Big Code Cleanup

Winter Cleaning

I'm up above 30,000 lines of matlab code at this point. Much of it was things I was trying but ended up not using, algorithms I replaced with something faster, etc.

Need to get rid of some dead wood.

Refactoring

Subversion comments, revision 307.

Note: ask Alex about pairwise vs. one-vs-all performance.

Main change was gutting svmclassifier10 and svmlearner10, getting rid of modes I don't use 
any more. I can always go back and pull these out of svm*7 if necessary, but for now it 
simplifies everything.

Changed makematch to create a 90-degree-rotated version of the two match matrices it made before. 
Trivial change... but it means that makesvm10 doesn't have to flip them anymore. They were so 
big that flipping was becoming impossible to fit in memory. Slow. Machines were swapping. Now 
hopefully I can handle N_train>50? 

Also it was just inconvenient to always keep track of whether you're indexing the pre-fipped or 
the post-flipped matrices.

So I have pairwise SVM mode working now, except it still doesn't work as well as maxwin and I'm 
not sure why. Not sure if dagsvm works: probably not, but I started looking at it. It would be 
nice to debug both these methods so that they can be compared against one-vs-all. They are also 
faster (most of the time).

Conceptually (in the paper) it is going to make more sense to compare tree comparisons to pairwise
(one-vs-one) comparisons. Then you're just saying, "these are blocks of the one-vs-one match that
we're just going to cut out, because they aren't on the same branch". What is going on with the 
clustering procedure has a direct analog with what pairs are compared.

With one-vs.-all (which is what I'm doing now) this will take more words to decribe. Anyway, I'd 
just be more comfortable if one-vs-all and one-vs-one didn't differ in performance by ~ 5%.

--This line, and those below, will be ignored--

M    matlab_code/topdown/makecascade3.m
M    matlab_code/misc/makesuffix.m
M    matlab_code/match/makematch10.m
M    matlab_code/svm_v0.55/@dagsvm/fwd.m
M    matlab_code/svm_v0.55/@pairwise/train.m
M    matlab_code/svm_v0.55/@pairwise/fwd.m
M    matlab_code/svm_v0.55/@maxwin/fwd.m
M    matlab_code/svm/makesvm10.m
M    matlab_code/svm/svmclassifier10.m
M    matlab_code/svm/svmlearner10.m

Dec 30 (Sat) : Improvement

Improvement

There was a potential problem with makecascade3.m. The plot below (Dec 29) use 25% regions on both the left and right sides of the margin distribution. That's fine... if the group sizes are equal. In some Caltech-101 groupings this was not the case.

Now there is an extra bias parameter that sets the correct prior for super-groups 1 and 2.

This may explain why the results from Dec 28 were a bit erratic.

Dec 29 (Fri) : makecascade (ICCV07)

Leakage Threshold

Histogram of margins, when classifying test images into one of two super-groups. Here leakage= 0.5. High margin indicates a large likelihood of group 2, while a low margin indicates group 1. The middle 50% of test images are left undecided. When leakage=0.0, all test images are undecided. When leakage=1.0, no images are undecided.

I want to illustrate a little more clearly exactly what is going on with the x-axis in the below plot (from Dec 28). The term leakage here refers to the fraction of data that gets passed to a lower level of the cascade. For example if leakage=0.5 then exactly half the test images will be classified into one of the two sub-categories. This half will benefit from a speed boost, since we only need to check the image against the other categories in its superclass.

The other half of the test images are not passed down. They do not benefit from a speed increase - they must be checked against every single category.

So why not pass all the test images down one of the two branches (leakage=1.0)? The plot to the right shows the margins for classifying test images into one of the two branches. A large positive or negative margin indicates high confidence. A small margin indicates low confidence. Picking just the fraction with the highest confidence gives us a better chance of making accurate predictions. Only the middle region is left unclassified; we are just admitting that those test images are too close to call. As the leakage threshold gets smaller, makecascade becomes more and more picky about what test images it will try to classify. A leakage threshold of 0.0 tells makecascade to leave all test images unclassified. This is the same as having no cascade at all.

Summary

Changing the leakage parameter lets us select for higher speed or higher performance. There is a direct tradeoff.

Dec 28 (Thu) : makecascade</ocde>

Today's tests show us that the super-classifiers work. At least they give results that are in the right ballpark. This is a good sanity check: otherwise there would be no point in going on with further tests.

First Test of Cascade Performance

Here is the test (on 8 different machines). Note that makecascade is linked to makecascade3:

 
cd ~/101/hist8b/1; makegroups3(10,21,200,1,2:101,[],11,100,10,3); [y111,x]=makecascade(10,21,200,0.01:0.01:0.99,2:101);
cd ~/101/hist8b/1; makegroups3(10,21,200,1,2:101,[],11,100,30,3); [y112,x]=makecascade(10,21,200,0.01:0.01:0.99,2:101);
cd ~/101/hist8b/1; makegroups3(10,21,200,2,2:101,[],11,100,10,3); [y121,x]=makecascade(10,21,200,0.01:0.01:0.99,2:101);
cd ~/101/hist8b/1; makegroups3(10,21,200,2,2:101,[],11,100,30,3); [y122,x]=makecascade(10,21,200,0.01:0.01:0.99,2:101);
cd ~/256/hist8b/1; makegroups3(10,25,200,1,1:250,[],11,100,10,3); [y211,x]=makecascade(10,25,200,0.01:0.01:0.99,1:250);
cd ~/256/hist8b/1; makegroups3(10,25,200,1,1:250,[],11,100,30,3); [y212,x]=makecascade(10,25,200,0.01:0.01:0.99,1:250);
cd ~/256/hist8b/1; makegroups3(10,25,200,2,1:250,[],11,100,10,3); [y221,x]=makecascade(10,25,200,0.01:0.01:0.99,1:250);
cd ~/256/hist8b/1; makegroups3(10,25,200,2,1:250,[],11,100,30,3); [y222,x]=makecascade(10,25,200,0.01:0.01:0.99,1:250);

I vary three things (underlined):

  • makecascade3(10,ntrain,200,classes)
    • 21, 2:101  :  Caltech-101 (blue line)
    • 25, 1:250  :  Caltech-256 (red line)
  • makegroups3(10,25,200,method,1:250,[],11,100,ntrial,nver)
    • 1  :  Monte Carlo Clustering (dashed line )
    • 2  :  Spectral Clustering       (solid line _______________)
  • makegroups3(10,25,200,method,1:250,[],11,100,ntrial,3)
    • 10  :  fast verification  (dim line)
    • 30  :  slow verification (bold line)
Dec28.png




where is the fraction of the data that leaks to the next level, in a 1-level cascade. In other words, if then there is no cascade while forces all data to cascade (into either the left or right branch).


Transfer all the y's from all the computers onto one machine, and then:

h=plot(x,y111,x,y112,x,y121,x,y122,...
x,y211,x,y212,x,y221,x,y222);
label={};
for i=1:length(h)
    dataset=       bitand(bitshift(i-1,-2),1);
    clustering=    bitand(bitshift(i-1,-1),1);
    verification=  bitand(bitshift(i-1, 0),1);
    if dataset     == 0 
         color= 'b' ; label{i}= 'Caltech-101';
    else
         color= 'r';  label{i}= 'Caltech-256';
    end
    if clustering  == 0 
         style= '--'; label{i}= [ label{i} ' MC' ]; 
    else
         style= '-';  label{i}= [ label{i} ' SP' ];
    end
    if verification == 0
         label{i}= [ label{i} ' 10' ];
    else
         label{i}= [ label{i} ' 30' ];
    end               
    set(h(i),'linestyle',style,'color',color);
    set(h(i),'linewidth',verification+1);    
end
legend(label);

Summary

This plot is a little busy but a few things are becoming clear:

  • Cascade performance is already good enough to try full-blown tests

See solid red lines. About 90% of the super-category classifications are correct at a threshold of 50%. That is, when 50% of the data can be classified into 2 smaller (more efficient) subgroups. This sort of cascade is likely to have significant speed increases with only a 10% drop in performance. And we aren't even trying hard yet.

  • Performance can be improved even more.

There are lots of things I'm not even trying yet. How about:

  1. Explore more of the parameter space (Ntrain, Nver )
  2. 0-th super-category consisting of categories that we won't try to super-classify
    1. Find the background data that does not fall into either cluster
    2. Isn't this what Marco's paper is about?
    3. Do his algorithms apply?
  3. More than 2 branches, ie. a more specific super-categories.
    1. Use lowest optimal number of clusters found by spectral clustering?
  4. Analyze what categories are being mis-classified the most
    1. Determine through verification subset of the training set?
    2. Look for patterns, some way to predict the "trouble" categories.
  5. Different cascade geometry and/or algorithm?
  • The more categories the better the performance

In other words, context helps. Caltech-256 performed better than Caltech-101 in the above test. That is, a cascade made up of 2 super-categories is more accurate when there are more categories in the dataset. Again think of a 3-year-old child. Give her examples of 3 visual categories and it is hard to construct their hierarchical relationships. But give her 30 or 300 and it is more likely that she will begin to perceive the larger patterns that they fall into.

Dec 27 (Wed) : makecascade3

Overview of cascade structure

How Does Our Cascade Work?

To recap, makegroups starts by creating a confusion matrix (with the training data only) and clustering it into two super-groups. Spectral clustering seems to work better than Monte Carlo clustering for this - but not always.

If a human being created two super-groups, they would probably something like "animate objects" and "inanimate objects". The makegroups does something similar: animate objects tend to be on a single branch. But the distinction is not so cut-and-dry. For example, kayaks may be grouped with dolphins. Or birdbaths may be grouped with birds. This isn't necessarily a bad thing - as long as the resulting super-groups correspond to useful classifiers.

These super-classifiers are constructed by pooling (a large number of) images from the two super-groups. The hope is that they would be able to correctly classify out-of-sample test images as belonging to one of the two two super-groups. Then we could pass them down two separate branches, using makecascade.

It is convenient to wrap both steps inside a single function.

  • makelevel
    • makegroups : create super-groups using only the training data
    • makecascade : classify test images are belong to one of the two groups

See figure (right). Further levels are now created by running makelevel recursively.

Leakage Threshold

Tradeoff between speed and performance will come from adjusting the leakage threshold. This determines how many test images get passed to the next level. We only pass the test images we are most sure about, ie. the SVM gave it a very large margin.

So if leakage=1.0 then every image is passed down to one of the two super-groups. If leakage=0.0 nothing is passed and it is exactly like what we normally do.

Primary Goal: Speed

The reason for using such a cascade is to speed things up. Once you have a rough idea what super-group your test image is in, you no longer have to test for images from the other super-groups. The more recursive branches you have, the less testing you need to do down at the leaves.

Secondary Goal: Performance

It is also possible that the performance (ie. accuracy) will be increased. Consider an individual test image which is passed down the cascade. Even if it is misclassified at one of the branches, the classification accuracy should be higher once we get down to the leaves (because we'll only need to test for a limited subset of categories). It is not yet clear how these two factors balance out.

Dec 26 (Tue) : makegroups3

See the Top-Down Classification Roadmap below if you want to know where this fits into the overall analysis pipeline. It is the half of makelevel that uses only the training data. Later in makecascade3, two mega-classifiers are then be constructed from these two groups.

Syntax:

makegroups3(10,25,200,1,1:250);

Have found defaults for nver and tver that seem reasonable in makegroups3, at least for testing:

    nver= ceil(100/ntrain);
    tver= round(ntrain/4);
    svmdefaults{5}= nver;
    svmdefaults{6}= tver;

In other words, 25% of the training data is treated as test data, and used to construct a confusion matrix that is used for grouping purposes only. This procedure is repeated about times. Note: this ratio is to be used for quick and dirty testing only. Remember to increase this for the final tests.

The variables that are getting stored in the group010_025_200.mat file are:

gtrains
2 cells containing indices of the categories used for each branch of the cascade.
itrain
For convenience, this is just grains{1} and gtrains{2} concatenated together. Useful, for example, for viewing ctrain(itrain,itrain).
ctrain
Confusion matrix constructed entirely from training data, which was the only data used to construct the groups
method
1=monte carlo clustering, 2=spectral clustering, 3=phylogenetic tree
categories
List of all category names (straight from match010_025_200.mat.

This is slightly different from what the Top-Down Classification Roadmap shows.

Dec 23 (Sat):

Addendum To Yesterday

It turns out that spectral clustering does yield balanced trees if you null out the diagonal. Now the groups are of length 134 and 116 which is fine.

So this means we can try both methods head-to-head and see how they do.

By the way, new syntax seems more consistant:

[gtrains,itrain,error]= getgroups3( ctrain2 , 2 ); 
viewgroups(ctrain2,gtrains);

Dec 22 (Fri) : Cascade: Making The Decision Tree

Three methods are on the table: Monte Carlo Clustering, Spectral Clustering and Phylogenetic Trees. The first two are the simplest so which one to try first?

Monte Carlo vs. Spectral Clustering

Top-Down Classification Roadmap
method Algorithm Advantage Disadvantage
1 Monte Carlo
Clustering
Asymmetric
matrix is ok
Slower
Inelligant
2 Spectral
Clustering
Faster
Elegant
Unbalanced?
Ugly ad-hoc
symmetrizing

Manual example (in ~/256/hist8b/1) just so I remember the syntax:

ctrain=makesvm10(10,25,200,1:250,[],11,100,5,2);
[gtrain1,gtrain2,gtrain]= getmonte(ctrain,1);
viewsplit(ctrain,gtrain1,gtrain2);

The resulting groups are of length 121 and 129. Pretty well balanced!

The problem I'm having with spectral clustering is that it has no incentive to make the groups balanced:

[gtrain1,gtrain2,gtrain]= getmonte(ctrain,2);

The resulting groups are of length 245 and 5! I'm not sure how to fix this problem... perhaps by creating n groups and then merging them in a bottom-up fashion. Or by adopting an algorithm that does not care if the trees are balanced. If you want performance however I'm guessing that balanced trees are essential, no matter what algorithm you use?

Summary

In order to get balanced trees, I may be stuck with monte carlo clustering for the time being. Need to discuss the problem with either Pietro or Lihi.

Dec 21 (Thu) : Software

Software Software Everywhere

I have finally documented the current state of the code. It is getting more complicated each day so it helped me to sit down and draw up the big picture in OmniGraffle (see right).

Dec 20 (Wed) : Top-Down

Which Taxonomies To Use Now For Top-Down Classification?

From the Dec 17 test (see below) it is looking pretty likely that spectral clustering will do the job. Need to make an easy-to-use function to do this, then get it into the top-down classification pipeline. That's the nuts and bolts part.

Some open questions:

  • Which is better: Spectral Clustering or Monte Carlo Clustering?
    • Ultimately this will probably be settled in the arena of top-down classification performance
    • Spectral Clustering is fast and efficient
    • But: Monte Carlo Clustering does not require ad-hoc symmetrizing of confusion matrix
  • Does spectral clustering correctly guess the total number of categories?
    • What are it's 1st, 2nd, 3rd, etc. guesses?
    • Compare this to my naive hand-made taxonomy for Caltech-256.
    • Isn't there an interesting psychometric experiment here? Ask Pietro, Shin?
  • Can either method throw a set of "background" categories that are hard to taxonomize?
    • Isn't this pretty much Marco's CVPR 2007 paper?
    • Ask Marco about this?
  • Which is better: confusion matrix or probability distribution matrix?
    • Probably the latter, but can SVM probability estimates be believed?
    • Does it make the paper too complicated?

We don't need all the answers above to crank out a top-down classification paper. But soon or later we need to deal with these issues.

Dec 19 (Tue)

Two Papers?

I'm starting to think that the phylogenetic tree story is complicated enough that it should be handled in a separate paper from the top-down category recognition paper.

For now I'll finish up Top-down Category Recognition for ICCV2007 with little or no mention of the phylogenetic approach. Merge this in later, or keep it as a separate paper. Ask Pietro?

Final Tree Tweaking

Simplified maketree2.m algorithm and arguments, now called maketree3. Ran 3 experiments below (the last one always crashed)?

[d,w,x]=maketree3(conf(indx,indx),categories(indx),'test37',  0,1,  3,0,0 , '', '' );
[d,w,x]=maketree3(conf(indx,indx),categories(indx),'test37G',  0,1,  3,0,0 , 'G', '' );
[d,w,x]=maketree3(conf(indx,indx),categories(indx),'test37S',  0,1,  3,1,0 , '', '' );
[d,w,x]=maketree3(conf(indx,indx),categories(indx),'test37SG',  0,1,  3,1,0 , 'G', '' );
test37
test37G
Global optimization
test37S
Subreplicates


Other than differences in overall rotation, they seem very similar.

Raccoon and iguana are mostly in very lush vegetation, which probably explains the confusion with ferns and flowers?

This raises the important point that these pictures are being taxonomized based on background just as much as foreground. Aside from segmenting, is there anything that can be done about this?

How Does Taxonomy Quality Depend On Ntrain?

How do taxonomies vary with Ntrain = 5, 15, 30? This is an important thing to know. If Ntrain=5, for example, and we can't get a reasonable taxonomy, then the whole top-down approach is likely to break down.

Generate confusion matrices overnight, for Caltech-256 and Caltech-101...

vision309>> [pmean,pstd,pall,conf]= multisvm10('~/256/hist8b',1:8,5,25,200,1:256,[],11,100,125,1);
vision310>> [pmean,pstd,pall,conf]= multisvm10('~/256/hist8b',1:8,15,25,200,1:256,[],11,100,25,5);
vision313>> [pmean,pstd,pall,conf]= multisvm10('~/256/hist8b',1:4,30,25,200,1:256,[],11,100,25,5);
vision318>> [pmean,pstd,pall,conf]= multisvm10('~/101/hist8b',1:8,15,16,200,2:101,[],11,100,25,5);
vision310>> [pmean,pstd,pall,conf]= multisvm10('~/101/hist8b',1:8,5,31-5,200,2:101,[],11,100,125,1);
vision314>> [pmean,pstd,pall,conf]= multisvm10('~/101/hist8b',1:8,30,1,200,2:101,[],11,100,25,5);

...then check on this tomorrow. Also running match030_025_200.mat for 5:8 subdirectories, which do not yet exist.

Summary

  • Taxonomic relationships seem to be based as much on foreground as background
    • Example: iguana and raccoons tend to be in lush forest settings
    • Need to confront the segmentation problem?
  • Streamlined maketree3 routine seems to work reasonably well
  • Adding global optimization and subreplicates has little effect
    • Both these optimizations are slow, so that's good news
  • Running some final 256x256 taxonomies now
    • Then will construct trees and see if Ntrain matters.

Dec 18 (Mon) : Phylogenetic Trees

Importance of Context

On Friday, I found that a 37x37 confusion matrix made up of just the 37 test categories generated almost meaningless trees. My guess is that, just like a human, you need to see numerous categories in order to get a feel for their taxonomic relationships. For example, 3 categories clearly isn't enough. Apparently, for the computer, 37 isn't enough either.

Experiment

Instead, cut the 37 categories out of the larger context of a full 250x250 confusion matrix. In each of ~/256/hist8b/[1234] do:

[conf15,perf]= makesvm10(15,25,200,1:250,[],11,100,25,5);
[d,w]=maketree2(conf15(indx,indx),categories(indx),'test01',1,3,999,0,1);
[d,w]=maketree2(conf15(indx,indx),categories(indx),'test12',1,3,999,1,2);
[d,w]=maketree2(conf15(indx,indx),categories(indx),'test22',1,3,999,2,2);

Here's a comparison:

1. 37x37 matrix
(from Friday)
2. 37x37 matrix
in context
3. added
subreplicates
4. added global
rearrangements
(slow!)


As already discussed, the 1st result is nonsense.

The 2nd result is the best. It puts flowers, mammals, structures, tools and appliances on separate branches with only occasional errors. My guess is that cactus ends up next to dog because it is confounded with porcupine? The other poorly classified categories are pyramid and iguana.

The 4th result is an improvement on the 3rd.

Summary

  • Computers can't learn taxonomies with only a few categories
    • Neither can a human!
    • Thus a small subset of categories can be tested, but...
    • ...only within the context of a larger confusion matrix
  • Subreplicates isn't helping (at least not in current form)
  • Global rearrangement helps a little? (but is very slow)
  • Best affinity-to-distance conversion algorithm still unknown
  • Need more experiments to see what works best. Try
    • Using branch lengths
    • Changing power-law of distance

Dec 17 (Sun) : Clustering

Start by getting the best hierarchy trees we can from the confusion matrix. Compare 3 different methods, head-to-head:

  • Monte Carlo Clustering
  • Spectral Clustering
  • Phylogenetic Algorithm (Fitch-Margoliash or Minimum Evolution)
  • (Sammon Mapping?)

There will be a few code improvements along the way.

Improved Confusion Estimate

So far a leave-one-out strategy has been used for generating a confusion matrix with nothing but the training data. I'll call this the confusion estimate from now on. A new argument tver in makesvm10 allows for a leave-n-out strategy instead. This generates a less noisy confusion estimate in less time:

cd ~/101/hist8b/1
[conf101,perf]= makesvm10(15,16,200,2:101,[],11,100,100,2);

results in conf015_016_200.mat. Likewise I ran 4 copies of

cd ~/256/hist8b/2
[conf256,perf]= makesvm10(15,25,200,1:250,[],11,100,25,2);

on 4 machines and combined the results.

Monte Carlo Clustering

Simple top-down monte-carlo block clustering is invoked like this:

[c101,indx1,indx2,err]= deconfuse2(conf101,20000,1);
[c256,indx1,indx2,err]= deconfuse2(conf256,20000,1);

where it doesn't usually help to go any higher than 20000. For this many iterations, rows and colums are exchanged so as to reduce the cross-variance between clusters. Multi-level trees come from invoking the same routine recursively on each cluster.

Confusion matrix need not be symmetric for this method to work (but deconfuse2 currently imposes this).

Spectral Clustering

Using ZPclustering.zip code from Lihi's web page.

cd ~/256/hist8b/1
load conf.mat 
c= conf+conf';
for i=1:length(a), a(i,i)= 1; end;
[clusts,indx,q,v]= cluster_rotate(a,2);
indx= [clusts{1}{1} clusts{1}{2}]; 
aspec= a(indx,indx);
[amonte,indx1,indx2,err]= deconfuse2(a,20000,1); 
for i=1:length(aspect), aspec(i,i)= 0; amonte(i,i)=0; end

Here's a side-by-side comparison of these first two clustering methods:

Comparison of Monte Carlo Clustering (left) to Self-tuning Spectral Clustering (right) for Ncluster=2.
Largest clusters are 132 and 142 categories for the two methods, respectively.

With spectral clustering you can specify an arbitrary number of clusters. But the hierarchical relationships between clusters are not so clear?

Self-tuning Spectral Clustering for Ncluster=4 (left) and 8(right).
Self-tuning Spectral Clustering has a slight preference for 2,11, or 17 clusters.

Preferred cluster sizes vs. Ncluster for some trial values. I'm just trying to get a feel for if they form hierarchies:

Ncluster Cluster sizes
2 142,108
4 106,80,6,58
8 46,74,4,35,11,26,24,30
11 149,5,8,5,5,34,3,11,5,8,17

Investigate hierarchies: do the Ncluster= 2, 4 groups overlap? Is the group of size 108, for example, almost the same as the group fo size 106? If so then spectral clustering is just taking the 142-member group and chopping it into smaller pieces.

Summary

Spectral clustering looks perferable to monte carlo clustering, but I'm not sure how best to construct a tree. Need to investigate hierarchical relationships in Ncluster=2,4,8 clusters, for example. Do they hold common sub-categories?

Two possibilities for constructing the trees are

  1. top-down : recursively break groups in half as before
  2. bottom-up: use optimal number of clusters and then merge to larger groups
  3. other: create Ncluster=2,4,8... groups and look for common elements

Do we even need a tree? How about an embeddeding in a 2D or 3D space, constructing quad trees and then searching the space? For now a tree is so useful for top-down searching that I'd like to stick with it.

Dec 16 (Sat) : Phylogenetic Trees

Code Cleanup

I finally cleaned up the code a great deal, stripped away some old directories etc. I don't use anymore, and commited to an svn repository. This had started in an svn repository but that got corrupted somewhere along the way and stopped working.

Old codebase is in ~/ee148/matlab_code.

New codebase is in ~/pyramid/matlab_code.

svn repository is accessible via svn+ssh://arbor.caltech.edu via my old 2005 default password.

Phylogenetic Trees: New Procedures To Try

Two improvements to try here:

Subreplicates
Different file format, and "use subreplicates" option selected. In English, this mean that data can be left out where none is available. A lot of the confusion matrix is empty, so this should leave the tree free to find better configurations. And faster. In theory.
Global Branch-Swapping
Does this help moderate the bottom-up approach, and generate trees that are more useful for top-down purposes? Let's experiment and find out.

Bug: kitsch and fitch cannot seem to handle labels more than about 10 characters long, for some reason. And no capital letters. It has a very finicky input file parser.

Changed from kitsch to fitch because it allows global rearrangements.

Without the 2 improvements above:

[d,w]=maketree(conf,categories(2:101),100,1,3,999,0,1);

and with:

[d,w]=maketree(conf,categories(2:101),100,1,3,999,1,2);

Note: 3rd argument above really does nothing but determine the output file prefix.

Experiment

So the plan is to try these two methods on a small subset of categories only. This is a subset with very clear hierarchical relationships that I decide in advance. Then compare what is found to my "ground truth".

Parameters that are varied:

  1. Global Branch-Swapping: on and off
  2. Subreplicates: on and off
  3. Different methods for
    1. Converting affinity to distance
    2. generating subreplicate weights
  4. Probability matrix instead of confusion matrix
    1. Holds more information
    2. Estimate probabilities - use method of Platt? [1]

Whatever gives the best results, extend to larger category sets.

Test set:

plants= [ 154 15 25 68 103 204 118 ];
animals= [38 80 129 152 164 85 56 186 84 116 168];
appliances= [171 239 237 156 21 142 220];
structures= [187 143 167 188 86 245 132 225];
tools= [125 234 180 199 ];
c=categories;
c={c(animals),c(plants),c(appliances),c(tools),c(structures)};
showstrings(c,4,15)
    1  038.chimp           2  080.frog            3  129.leopards-10     4  152.owl
    5  164.porcupine       6  085.goat            7  056.dog             8  186.skunk  
    9  084.giraffe        10  116.iguana         11  168.raccoon

    1  154.palm-tree       2  015.bonsai-101      3  025.cactus          4  068.fern
    5  103.hibiscus        6  204.sunflower-1     7  118.iris

    1  171.refrigerato     2  239.washing-mac     3  237.vcr             4  156.paper-shred
    5  021.breadmaker      6  142.microwave       7  220.toaster

    1  125.knife           2  234.tweezer         3  180.screwdriver     4  199.spoon

    1  187.skyscraper      2  143.minaret         3  167.pyramid         4  188.smokestack
    5  086.golden-gate     6  245.windmill        7  132.light-house     8  225.tower-pisa

load hist005_025_200.mat; % just to get categories
indx= [animals plants appliances tools structures];
save treetesting.mat animals plants appliances tools structures conf categories indx

Also acceptable if the algorithm puts porcupine on the same branch with cactus!

And now the experiment:

cd ~/256/hist8b/1;
load treetesting.mat
[conf15,perf]= makesvm10(15,25,200,indx,[],11,100,100,5);
[d,w]=maketree(conf15,categories(indx),37,1,3,999,0,1); % method 1
[d,w]=maketree(conf15,categories(indx),37,1,3,999,1,2); % method 2 (better?) 
Failed attempt to generate taxonomies with only 37 categories. I think the computer hasn't seen enough categories yet to build meaningful taxonomic relationships.

These trees look like junk. It is as if leaves were put next to each other at random. Whether method 1 or method 2 above.

Summary

An interesting discovery today. Building a hierarchy seems to require large numbers of categories. I got junk trees (see figure to the right) when running just the 37 test categories, and just a 37x37 confusion matrix.

Hypothesis: you can't generate meaningful trees on a small subset of the data. This is the standard procedure most of the time: start with 5 categories, move up to 10, 50 and so on. But this won't work here.

Suppose you were a human infant who had never seen the world. Could you make a hierarchy after seeing just 5 categories? 10? 37? No. You would need to a great many categories before you got a feel for their taxonomy. The computer is no different.

That's intuitive but I never really thought about it before.

Need to confirm the above hypothesis. To be continued...

Dec 15 (Fri)

Plan for next week

  • Tree algorithm
    • Use subreplicates : improvement?
    • Compare two different methods for affinity-to-distance conversion
    • Algorithm for finding logical branch points
    • Any top-down bottom-up hybrid algorithms out there?
  • SVM
    • Figure out why pairwise and dagsvm modes aren't working
    • This would speed things up considerably
  • Top-down classification experiments
    • Fix bugs in pipeline
      • Compare level=0 and level=1 performance
    • Simplify pipeline
      • Generalize to any number of levels
      • Compare level=2,3... performance
  • Finish up ICCV paper
    • Phylogenetic section
    • Results & Conclusions
    • Pietro, Eugene for comments

Plan for 2007

I sat down and tried to think about where all this is going over the next few years. See diagram below:

Vision Projects 2007.png


Dec 11-14 (Mon-Thu)

Home with gastroenteritis...

Dec 6 (Wed)

Original Datset + 8 New Categories (for Ryan)

This is for a possible: supplemental figure. There was the problem before where some of the file names in category matress were truncated. Ryan sent the index file and now we can fix that:

load ~/256/ryan/trainIndex.mat
load ~/256/hist8b/1/hist075_004_200.mat
ftrain1= ftrain(trainImages);
newcats={'129.leopards-101','253.faces-easy-101','232.t-shirt',...
'145.motorbikes-101','251.airplanes-101','240.watch-101','137.mars',...
'092.grapes'};
cd ~/256/sift8
ftrain2= {};
for cat=newcats,
  x= pickfiles(cat,350,0);
  ftrain2= {ftrain2{:} x{:}};
end
ftrain= {ftrain1{:} ftrain2{:}};
for i=1:length(ftrain), ftrain{i}((end-2):end)='mat'; end
makehist7({},'~/256/ryan','cvpr',ftrain,{});
makematch7(1,{},'~/256/ryan','cvpr');

Running now on vision310.

Resulting file should be ~/256/ryan/histcvpr.mat

Dec 3 (Sun)

Final Update

There are still some bugs to work out so (unfortunately) this is going to have to wait for ICCV 2007.


Dec 2 (Sat)

Speed And Performance As A Function Of Threshold

Need plot showing tradeoff between speed and performance as the threshold is adjusted.

Syntax Notes

makebranch(0.25,15,16,200,1:100,[]);

New File for Ryan Bombed

Oops- corrected a mistake: searching for .mat has to be done carefully since 138.matress is a category. There appear to be some errors in the filelist which I'll track down later (this is going in the supplemental section anyway).

Dec 1 (Fri)

Tree Structure, Algorithm and Code Overview

Current overall algorithm that appears to work best?

makebranch

  • Simple enough that it can be re-run recursively
  • Threshold yet to be determined.
    • Specifies how many files are selected for 1&2 vs. Neither
    • Selection based on their SVM margins.
  • Use training data only


makesummary

  • Traverse the tree gathering stats
    • Confirm how many files were declared 0/1/N at each level
    • training/testing execution times at each level
    • final performance and confusion

Exampletree2.png

In more detail, makebranch runs the following:

  • makesvm
    • hist*_200.mat => match*_200.mat => conf*_200.mat
    • Inputs: overall match file
    • Ouputs: "training" confusion matrix
      • Uses only training set, leave-one-out, nver>0
  • makegroups2
    • conf*_200.mat => group*_200.mat => match*_001.mat
    • Inputs: "training" confusion matrix
    • Outputs: "
      • two groupings that minimize confusion matrix crossterms
      • match file classed by these 2 groups, ready to test
        • with all the original classes & file info & margins
        • the leaves will need this
  • makecascade
    • Takes threshold as argument
    • Assign each training images to either 0, 1 or 2 (Neither)
    • Train on match*_001.mat
    • Build 3 new match matrices:
      • N0train x N0test , N1train x N1test , N2train x N2test
    • Where:
      • N0test+ N1test+N2test = Ntest
      • N0train+ N1train= N2train= Ntrain
    • Inputs: match*_001.mat
    • Output: match*_0[012]1.mat

New Match File for Ryan

See new makehist7 which is a trivial modification of makehist6. The only difference is that you can manually input any ftrain, ftest files you want. Just so I remember the syntax:

cd ~/256/sift8;
load ~/256/ryan/trainNames.mat
ftrain1= trainNames;   % set of 5097 files
newcats={'129.leopards-101','253.faces-easy-101','232.t-shirt','145.motorbikes-101','251.airplanes-101','240.watch-101','137.mars','092.grapes'};
cd ~/256/sift 
ftrain2= {};
for cat=newcats,
  x= pickfiles(cat,350,0);
  ftrain2= {ftrain2{:} x{:}};
end
ftrain= {ftrain1{:} ftrain2{:}};
for i=1:length(ftrain), ftrain{i}((end-2):end)='mat'; end
makehist7({},'~/256/ryan','cvpr',ftrain,{});

Remind Ryan to look for duplicates: I'm picking the ~350 files at random, so they may overlap with the fileset already in his trainNames Should be able to see this as 1's in the match matrix.

See results in ~/256/ryan.

Nov 30 (Thu)

Finished the Hierarchy-Based Analysis Pipeline

  • Divide each conf file into 2/4/8 sets of categories
  • Train initial part of cascade
  • Train/test final part of cascade
  • Run more conf files if machines are available.

This has the following steps:

  1. makesvm : confusion matrix using training set only (leave-one-out)
  2. makegroups: make 2/4/8/etc. groups based on confusion matrix
  3. makesvm stage 1: sort based on upper-level categories
  4. Results in conf1,conf2... with corresponding file lists
  5. makesvm stage 2: one-vs all on the leaves
  6. maketree: makes a little tree (just to help visualize this)
  7. performance plots

Finished all this today.

Now... spend all Friday generating first performance plots!

Nov 29 (Wed)

Started Writing

Began writing... just Abstract and Intro and Bibliography (while some stuff was running).

Setting Up Top-Down Category Benchmarks

vision312>> cd ~/101/hist8b/1;  mat=makesvm(5:5:30,31-(5:5:30),200,2:101,[],11,100,25);
vision320>> cd ~/101/hist8b/2;  mat=makesvm(5:5:30,31-(5:5:30),200,2:101,[],11,100,25);
vision315>> cd ~/256/hist8b/1;  mat=makesvm(10:10:30,25,200,1:250,[],11,100,25);
vision309>> cd ~/256/hist8b/2;  mat=makesvm(10:10:30,25,200,1:250,[],11,100,25);
vision310>> cd ~/256/hist8b/3;  mat=makesvm(10:10:30,25,200,1:250,[],11,100,25);
vision123>> cd ~/256/hist8b/1;  mat=makesvm(5:10:30,25,200,1:250,[],11,100,25);
vision123>> cd ~/256/hist8b/2;  mat=makesvm(5:10:30,25,200,1:250,[],11,100,25);
vision301>> cd ~/256/hist8b/3;  mat=makesvm(5:10:30,25,200,1:250,[],11,100,25);

Should now save conf files along the way, which is the first step in the pipeline. These are created entirely with the training data (leave-one-out strategy).

Nov 28 (Tue)

CVPR Madness

Deadline approaching. The plan:

Nov 28 (Tue)
Final trees, run benchmarks. Write: title and intro, add refs.
Nov 29 (Wed)
More benchmarks. Write: method section w/ phylogenetic algorithms. Meet with Pietro
Nov 30 (Thu)
Time to try an alternate tree geometry? Write: top-down section.
Dec 1 (Fri)
Write: results, conclusions. Meet again w/ Pietro
Dec 2 (Sat)
Final writing. Feedback?

Syntax

I want to remember this two-line incantation for generating trees by hand (not with makesvm):

mats= makesvm10(5,26,200,2:101,[],11,100,10);
[d,x,x1]= maketree(mats,categories(2:101),5,0,3,999,1);

Tree Code Fixes

Found a few bugs. One was that the phylogenetic software is aweful at parsing, and it will not (for some reason?) allow category labels with capital letters. It crashes in a strange way, claiming that the distance matrix diagonal element is not zero.

I'm re-running the trees...

Nov 27 (Mon)

If this goes ok, these will be the final trees before running final top-down benchmarks for paper:

Final Tree Generation for Caltech 101/256 for CVPR Taxonomy Paper

Simplified maketree algorithm to make paper easier to write. See maketree, normflag=0. Trees automatically generated within makesvm now when nver is negative.

For Caltech 256:

cd ~/256/hist8b/1;
makesvm([5 10 15 20 25 30],25,200,1:250,[],11,100,-25);

Currently running ~/256/hist8b/{1,2,3,4} on vision309,315,317,318

Likewise for Caltech 101:

cd ~/101/hist8b/1;
makesvm(5:5:30,31-(5:5:30),200,1:101,[],11,100,-25);

Which is running on vision312. Notes:

  • Caltech-101 run is a bit of an experiment, since I'm including clutter
    • See where this ends up in the tree?
  • These all use method=11 which stands for one-vs-all.

I should have resulting trees by Tuesday morning.

More Boosting Comparisons: Scene Database

So this test assumes ~/scene/hist8/1/match100_100_200.mat is the input, and that mtrain gets divided in half (randomly) 100 times so that Ntrain = 50 and Ntest = 50 each time:

alex1(100,'alex',100,'match100_100_200.mat','match050_050_200.mat');
p05= multisvm('~/scene/hist8/1/alex',1:100,50,50,200,1:5);
p10= multisvm('~/scene/hist8/1/alex',1:100,50,50,200,1:10);
p15= multisvm('~/scene/hist8/1/alex',1:100,50,50,200,1:15);

Minor note: realized as I was doing this that ~/scene/hist8b match files are 1500x1500x3 instead of 1500x1500 because I used to store the 3 match levels separately (just sum over the 3rd dimension). I forgot to tell Alex about this. Final numbers:


Ncat Performance
5 94.73%
10 87.62%
15 77.95%

Nov 26 (Sun)

Color Pyramid Matching for the CVPR Boosting Paper

This may help us with the boosting paper. Found some references to a LAB sort of sounds like what I had in mind.

rgb2lab.m

brief description

The idea would be to use the exact same algorithm I'm using now, except to run it separately on the L, A and B images and exploit all 3 kernels instead of just the L kernel (which is, more or less, what I'm using now).

Alex and I discussed trying this on the scene database first.

Nov 22

Abstract Submission for CVPR Taxonomy Paper

Submitted this to CVPR from Florida where I'm currently working.

Here is where I can click to submit the final paper.

Rewrite this to emphasize what is novel.


Learning and Using Taxonomies For Visual Category Recognition

G. Griffin, E. Bart and P. Perona

Humans understand the world around them, in part, by learning hierarchical relationships between categories of objects. Someone playing the game of "20 Questions" is likely to exploit such a mental taxonomy in order to guess the unknown object with the fewest possible queries. As computers learn to recognize and classify hundreds or thousands of categories, taxonomies become more and more important. Yet it is impractical for a human operator to continually generate them by hand. We use the Pyramid Matching algorithm of Lazebnik et. al. to define an affinity between categories of images in the Caltech-256. Phylogenetic algorithms are then used to automatically construct a full taxonomy. We then use a simple combination of decision trees and multi-class SVMs to show how efficiency can be drastically improved, while performance remains comparable with the best current algorithms. We conclude with a discussion of potential applications.

Nov 20 (Mon)

More coding today...

Planning for the CVPR Taxonomy Paper

Greg CVPR 2007 planner.png

Nov 18 (Sat)

Comparison To Alex's Boosting Results : Caltech 256

Alex is using ~/256/hist8b/1/match050_025_200.mat and (randomly) dividing into two pieces. I want to use the exact same dataset for a more accurate comparison.

He gets around 72/63% for 5/10 classes at Ntrain=25, at least as of yesterday?

Results50 25.png

The routine alex1(100) takes the above match file and constructs 100 sub-directories in ~/256/hist8b/1/alex. In each directory is a new match file created by choosing 25 members of each class in mtrain to be train and test, respectively. Then:

>> p5= multisvm('~/256/hist8b/1/alex',1:100,25,25,200,1:5);
>> p10= multisvm('~/256/hist8b/1/alex',1:100,25,25,200,1:10);
>> p25= multisvm('~/256/hist8b/1/alex',1:100,25,25,200,1:25);
>> p50= multisvm('~/256/hist8b/1/alex',1:100,25,25,200,1:50);
>> plot([5 10 25 50],[p5 p10 p25 p50],'x');

Results:

Ncat Performance
5 74.0%
10 63.8%
25 59.2%
50 47.8%

Nov 17 (Fri)

Constructing Tree Directly From Kernel : Not Working So Far

Outtree030 0.50.gif

The tree to the left was the best I was able to get, using the first 100 categories only (because it is much faster this way). The idea here is, instead of using the confusion matrix produced by the SVM, find the distances between clusters of categories in the kernel. By distance we mean, histogram all the distances between the images in those two categories and find the the p-percentile high or low values. For example if p=0.50 we would be using the median.

IMO the results are not very good.

d=maketree(.5,1,5,100,2,30,25,200);
!treedraw outtree030_0.50 90 0.5

Note that the maketree code is currently on my macbook, and is different from the maketree code in the vision cluster (need to merge these two codesets soon).

Perhaps there is some sort of problem with normalizing the kernel in a way that makes sense for the phylogenetic software. I tried several different methods of converting affinities into distances. I also tried p=.95,.90,.50,.10,0.5.

Maybe there is more fundamental problem here.

There is a reason we use an SVM, instead of just clustering from the kernel itself. The SVM bumps the data up into a higher dimensional space where the categories are easily separable. The raw kernel does not. If the raw kernel clustered well, we would just skip the SVM and use a GMM instead.

So when we try to use the kernel itself to build the tree, aren't we essentially throwing away all the advantages of using an SVM?

I am still trying to get this to work. But I am more inclined to use the confusion matrix for building the trees, instead of the affinity matrix At least for the CVPR2007 paper, since this has been shown to work fine.

Nov 16 (Thu)

Performance as function of Ncategory

This is of interest to Alex and I (for comparison to his active learning results): What sort of performance does Pyramid Matching get on Caltech101/Caltech256 if you use only a limited number of categories?

The green cross shows the performance Alex is currently getting (if I'm interpreting his plot correctly). We are both using Ntrain=25.

The wide regions represent multiple different random choices of categories. This shows that Caltech 101 categories are less challenging than the Caltech 256 categories: performance is worse, even when the number of categories is the same.

Ncategory2.png
rand('state',sum(100*clock)); 
x=randperm(100)+1; 
p=[0]; n=[5 10 15 20 25 50 100]; 
for i=1:length(n), 
n(i)
[pmean]= multisvm10('~/101/hist8b',1:4,25,6,200,x(1:n(i))); 
p(i)=pmean; 
end
rand('state',sum(100*clock)); 
x=randperm(256); 
p=[0]; n=[5 10 15 20 25 50 100 150 200 256]; 
for i=1:length(n), 
n(i)
[pmean]= multisvm10('~/256/hist8b',1,25,25,200,x(1:n(i))); 
p(i)=pmean; 
end

LabelMe Database

It will be useful for Ryan and me to extract sift features for the LabelMe Database. Want to download and run this today.

Oops... we have to do some data segmentation before it will let us download the whole database. This is sort of a pain and I didn't finish yet.

Update: Ryan says they're not going to use it anyway.

Nov 15 (Wed)

Top-down Category Creation

Phylogenetic trees are not balanced. A more balanced binary tree may be created as follows ( Ntrain = 25 ). Below just generates the first 2 branches, but of course this can be done recursively.

Caltech-101:

cd ~/101/hist8b/1
mat=makesvm(15,31-(15),200,2:101,[],11,100,25);
c=mat+mat';

Caltech-256:

cd ~/256/hist8b/1
mat=makesvm(15,25,200,1:250,[],11,100,25);
c= mat+mat';

Then rewrote deconfuse as deconfuse2 . A much more efficient search algorithm: starts by taking the rows with the biggest out-of-block variance and swapping them. This does the bulk of the work. Then do ~10000 random swaps for minor improvement:

[c,i1,i2,improvement,f]= deconfuse2(mat+mat', 100000); [length(i1) length(i2) f], imagesc(c); 

EPCA: Entropic Principle Component Analysis?

The idea is that it is not good enough to just pick axes with a lot of variance. You also want to pick axes with low entropy, ie. lots of clumping. Especially in the unsupervised case.

Is this idea reasonable at all? Quick test:

Cluster EPCA.jpg
cd ~/256/hist4
load hist350_000_200.mat
bins= bagofwords2(htrain);
[eigenvectors,z,eigenvalues]= princomp(bins);
bins2= bins*eigenvectors;
class= ceil((1:1400)/350);
scatter3(bins2(:,1),bins2(:,2),bins2(:,3), 25, class');

Now plot each of the first 6 axes against one another:

XY EPCA.png
x=-.15:.005:.15; y=-.15:.005:.15;
for i=1:6,for j=1:6, 
z=hist3( bins2(:,[i j]), {x, y}); 
subplot(6,6,i+(j-1)*6); imagesc(z); 
if j==6, xlabel(num2str(i)); end; 
if i==1, ylabel(num2str(j)); end; 
end; end

It is not so ambiguous, in this case, that the PCA space (ie. the space in which the features are linearly independent) is also the optimal space for clustering. But with more categories, is it still clear that PCA is what you really want to do?

Maybe there are axes with very little variance that, nonetheless, are informative - ie. they form compact, well defined clusters.

Above we had very easy categories to work with: motorbikes, airplanes, t-shirt and faces-easy. Let's try some less obvious classes instead:

XY EEPCA2.png
load hist075_004_200.mat
bins= bagofwords2(htrain);
[eigenvectors,z,eigenvalues]= princomp(bins);
bins2= bins*eigenvectors;
n= length(bins2);
class= ceil((1:n)/75;
%f= find(class<=4);
f= find( (class<6) & (class<=10) );
a=10;  
for i=1:a, 
for j=1:a, 
xstd= std(bins2(f,i))*3;
ystd= std(bins2(f,j))*3;
x=-xstd:xstd/10:xstd; 
y=-ystd:ystd/10:ystd;
z=hist3( bins2(f,[i j]), {x, y});  
subplot(a,a,i+(j-1)*a); imagesc(z); 
if j==a, xlabel(num2str(i)); end;  
if i==1, ylabel(num2str(j)); end;  
end; 
end;
scalesubplots(1.2);

So far I'm not very optimistic: if your eye can't see any "blobs", it is unlikely that the mutual entropy calculation

will turn up anything either.

Still it may be worth computing a mutual entropy matrix on the original bins and diagonalizing this. This isn't quite the same thing as doing PCA. At the end of PCA the new set of 200 "eigenfeatures" are uncorrelated. At the end of EPCA, the features would be statistically independent. This is not the same thing. And after all, information is what we ultimately care about isn't it?

But I'm still not sure how to formally define EPCA. I think we would have to do away with the and replace it with . Then we have something that has the same dimensionality as the covariance matrix. The convention of using is just one particular way of defining entropy, not the only way.

Even if there is such a method, I doubt it will be ready for CVPR2007. Drop it for now.

Need to think about: relationship to spectral clustering.

Nov 14 (Tue)

How to Use Hierarchies: Top-down approaches

I'm implementing a simple decision tree at the moment, because it is easiest. It is not clear what we ultimately will use.

There are at least 4 different approaches to the top-down category search, that have come up so far in discussions. Summary below. For CVPR2007, we will not have time to try them all.

Method Suggested by Advantages Disadvantages Parameters
to tune
Simple decision tree Greg Very fast
Simplest implementation
Performance Nlevels
Fleuret and Geman approach Pietro Good speed
Good performance
Threshold
Viola and Jones approach
(similar to what Eugene
implemented for Scene recognition)
Eugene Good performance Speed
A little harder to implement
Cascade levels:
How many?
How leaky?
How asymmetric?
DAG for Hierachical
Classification
Chin-Jen
Lin
Elegent, fast, performance Implementation?

Nov 13 (Mon)

Notes from meeting w/ Pietro

See our notes.

We discussed, among other things, using the affinity matrix itself to build the taxonomy instead of the confusion matrix. This is a more direct and less complicated (don't need to repeatedly leave-one-out of the training set).

Spectral clustering may eventually be worth a try here. But we want to keep things as simple as possible for now.

This suggests the following change to yesterday's list:

  1. Compute top 4 sub-groups from affinity matrix
    1. Using pure phylogenetic approach, not a hybrid block-diagnolization approach
      1. Harder to explain in paper
      2. Optimization is premature
    2. Generate affinity matrix using only the training set
    3. Generate tree from affinity matrix
      1. How does tree compare to the confusion-matrix approach?
    4. Make this an automatic part of the analysis pipeline

Nov 11 (Sat)

Top-down category recognition

See web page here with more details.

Here's a brief breakdown of the experiment #1:

  1. Automatically compute top 4 sub-groups (w/ training set only)
  2. Top level: evaluate which "branch" each test image is on
  3. Bottom level: evaluate which "leaf" each image agrees with
  4. Compare to the standard one vs all testing scheme

To start, let's break item 1 down into it's algorithmic parts, and implement:

  1. Automatically compute top 4 sub-groups (w/ training set only)
    1. Generate confusion matrix using only the training set
      1. Otherwise the resulting tree would be cheating
      2. Repeatedly leave-one-out of training set and use these for "testing"
    2. Generate tree from confusion matrix
      1. Just the top 4 branches
      2. Phylogenetic approach is overkill for this
      3. Simply re-group the confusion matrix into 4 well-correlated blocks
    3. Make this an automatic part of the analysis pipeline


Coding today...

Nov 10 (Friday)

Plan for CVPR 2007 Papers

Made some initial notes on the following.

Need to discuss/modify these topics with Alex, Eugene and Pietro.

Which paper is more worthwhile, if I can only write one?

Or how about some other topic?

Some notes after talking w/ Ryan

I wonder if KPCA alone is good enough. Clutter-like features have tons of variance overall, but they don't cluster well (ie. they aren't distinctive to a particular class). We would like to find a basis that maximizes variance AND "clumpiness" at the same time. Isn't this just another way of saying: maximize the ratio of variance to entropy?

This would be like an unsupervised version of Fisher Discriminants. Could this be a good general-purpose tool?

I need to think more about this:

Diagonalize Matrix of Mutual Covariance Ordinary PCA
Diagonalize Matrix of Mutual Entropies Maximally "informative" axes???