Greg's Notebook, Winter 2006-2007
Apr 13
tree= maketree6(2,1,[],10,25,200,plevels,2,2,1:256,1:256,[],11,100,100,2,5);
Apr 12 : Deeper Tree Testing (ICCV 2007)
On each machine running at depth=3
, running the following.
Could write multitree5(d,plevels,'april_12a')
to automate this?
file='april_12a'; take all plevels ; share(file,'plevels'); tree= maketree5(3,0,10,25,200,plevels,0,0.5,1:256,[],11,100,100,2,5); tree10_0=tree; share(file,'tree10_0'); tree= maketree5(3,0,10,25,200,plevels,1, 10,1:256,[],11,100,100,2,5); tree10_1=tree; share(file,'tree10_1'); tree= maketree5(3,0,10,25,200,plevels,2, 2,1:256,[],11,100,100,2,5); tree10 =tree; share(file,'tree10'); tree= maketree5(3,0,15,25,200,plevels,2, 2,1:256,[],11,100,100,2,5); tree15 =tree; share(file,'tree15'); tree= maketree5(3,0,20,25,200,plevels,2, 2,1:256,[],11,100,100,2,5); tree20 =tree; share(file,'tree20'); tree= maketree5(3,0,30,25,200,plevels,2, 2,1:256,[],11,100,100,2,5); tree30 =tree; share(file,'tree30');
Changes required (v6) for all the topdown software required to accomidate multiple levels. New syntax is:
tree= maketree6(2,0,[],10,25,200,0.0:0.25:1.0,2,2,1:256,[],11,100,25,2,5);
Apr 11 : Final Testing (ICCV 2007)
Tests (running on different machines):
308>> tree2_30_2_5b= maketree5(2,2,30,25,200,plevels,2,2,1:256,[],11,100,100,2,5); share april_11_morn tree2_30_2_5b 307>> tree2_20_2_5= maketree5(2,2,20,25,200,plevels,2,2,1:256,[],11,100,100,2,5); share april_11_morn tree2_20_2_5 123>> tree2_20_2_5b= maketree5(2,302,20,25,200,plevels,2,2,1:256,[],11,100,100,2,5); share april_11_morn tree2_20_2_5b 320>> tree2_10_2_5= maketree5(2,2,10,25,200,plevels,2,2,1:256,[],11,100,100,2,5); share april_11_morn tree2_10_2_5 103>> tree2_10_2_5b= maketree5(2,2,10,25,200,plevels,2,2,1:256,[],11,100,100,2,5); share april_11_morn tree2_10_2_5b 105>> tree1_10_2_5= maketree5(2,2,10,25,200,plevels,1,10,1:256,[],11,100,100,2,5); share april_11_morn tree1_10_2_5 303>> tree1_10_2_5b= maketree5(2,2,10,25,200,plevels,1,10,1:256,[],11,100,100,2,5); share april_11_morn tree1_10_2_5b 317>> tree0_05_2_5d= maketree5(2,0,10,25,200,plevels,0,0.5,1:256,[],11,100,100,2,5); share april_11_morn tree0_05_2_5d 106>> tree0_05_2_5= maketree5(2,2,10,25,200,plevels,0,0.5,1:256,[],11,100,100,2,5); share april_11_morn tree0_05_2_5 302>> tree2_15_2_5= maketree5(2,0,15,25,200,plevels,2,2,1:256,[],11,100,100,2,5); share april_11_morn tree2_15_2_5
Apr 10
Original Datset + 4 New Categories (for Ryan)
This time the new categories are
- 054.diamond-ring
- 062.eiffel-tower
- 131.lightbulb
- 027.calculator
load ~/256/ryan/trainIndex.mat load ~/256/hist8b/1/hist075_004_200.mat ftrain= ftrain(trainImages); newcats={'253.faces-easy-101','232.t-shirt','145.motorbikes-101','251.airplanes-101',... '054.diamond-ring','062.eiffel-tower','131.lightbulb','027.calculator'}; cd ~/256/sift8 ftest= {}; for cat=newcats, x= pickfiles(cat,350,0); ftest= {ftest{:} x{:}}; end % ftrain= {ftrain1{:} ftrain2{:}}; for i=1:length(ftrain), ftrain{i}((end-2):end)='mat'; end makehist7({},'~/256/ryan','iccv',ftrain,ftest);
For some reason I had to change the match syntax from last time to:
makematch10(1,'iccv.mat',{},{});
Running now on vision310
.
Apr 6 : Tests
I need two tests more than any other at this point:
- comparison of random, montecarlo and spectral clustering
- at least 1 level of recursion
- for best method above, comparison of Ntrain=10,20,30
- at least 1 level of recursion
Apr 5 : Parameter Fishing
Assume you have first
shared t={}
and plevels
in repository 1
.
cd ~/256/hist8b/1/topdown/[machine name here] take 1; count=0; for i=2, for j=(i+2):10, count=count+1; [i,j,count], t{count}=maketree5(1,10,25,200,plevels,2,2,1:256,[],11,100,25,i,j); end; end
Running i=1,2,...etc
on different machines, it is hard to distinguish between
them with all the noise. So I think 25,1,5
is probably an ok compromise:
performance is pretty good and speed is ok too. 25,2,5
is probably also
alright, and about 40% faster.
Scripts like checkall `checklist`:
and roundup
help me figure out
where to go.
Apr 4 : Simplified Code (continued)
groups=cell(3,1); cascade=cell(3,1); for i=1:3, i, groups{i}= makegroups5(10,25,200,2,2,1:256,[],11,100,10,2,3+i); end for i=1:3, i, cascade{i}= makecascade5(groups{i},.02:.02:1.0); end
Meanwhile I'm running 3 copies of this overnight, to check their consistency:
groups= cell(10,10); cascade= cell(10,10); for i=3:-1:1, for j=4:8, [i j], groups{i,j}= makegroups5(10,25,200,2,2,1:256,[],11,100,25,i,j); cascade{i,j}= makecascade5(groups{i,j},.02:.02:1.0); end; end end end
In the current syntax this is viewed like so:
clf; labels={}; labels=viewlevels5(3,labels,cascade{3,4:8}); labels=viewlevels5(2,labels,cascade{2,4:8}); labels=viewlevels5(1,labels,cascade{1,4:8});
These plots indicate that ntrial,nver,nkeep=25,2,6
has
better performance than 25,2,5
.
Apr 3 : Simplified Code
- maketrees5
- Like
makelevels4
except allows for recursion, and callsmakegroups5
,makecascade5
andmakelevels5
separately. - makegroups5
- Generates 3rd "null" category, instead of
makecasacade5
. Add phylogenetic method and probabilities as an option. - makecascade5
- Use probabilities instead of margin to assign groups 1, 2 and null. Evaluate all the test files for their margins (probabilities?) but do not generate final performance stats.
- makelevels5
- Break out the 2nd part of
makecascade5
, which evaluates the performance for a range of leakage values (and the below cascade information). - makesvm14
- Generates probabilities. Subdivide into
makesvm14
andgetsvm14
, so thatmakecascade5/makegroups5
don't have to reload match information every time? (this may not be worth the effort)
One extra thing is that, for full-scale testing, it might be useful to have all of these routines generate sensible outputs for a range of ntrain,ntest,nwords
.
Apr 2 : What's Left To Be Done For ICCV Paper
Tests
|
Plots
Text
|
Schedule
- Apr 4
- Submit abstract. Test recursive hierarchy code. Tests for optimal Nver for different values of Ntrain. Create reasonable phylogenetic taxonomy. Test converting margins into probabilities (new version of
makesvm
). Add diagram illustrating cascade, with accomplanying text.
- Apr 5
- Confirm abstract. Test: probabilities better than margin? Full-scale recursion tests on Caltech-256, Caltech-101 (only 3 out of 4 methods). Finalize phylogenetic taxonomy. How about Caltech-N?
- April 6
- Test Caltech-N. Figure out how to use phylogenetic taxonomy. Add taxonomy figures.
- April 7
- Full-scale Caltech-N tests. References, such as previous hierarchical approaches.
- April 8
- Finalize paper. Comments from Pietro.
Mar 24 : More Results (ICCV 2007)
Mar 23 : Double-checking the results.
Mar 22 : Results (ICCV 2007)
Mar 21 : Software For Large-Scale Testing
Rapid Testing Procedure
Replacing:
groupmode=2; % or groupmode=1 global train_mode train_fraction train_mode=1; train_fraction=0.33; indx= 1:25; leakage(indx)= (100-4*indx)/100; p=leakage*0; t=leakage*0+9e9; for i=indx; [p(i),x,t(i)]=makelevel4(1,10,25,200,... groupmode,leakage(i),1:256,100,2,5); fprintf('\n\nPLEVEL %d (%f) complete\n',i,leakage(i)); clf; ax=plotyy(leakage,p*100,leakage,25*256./t); set(get(ax(1),'Ylabel'),'String',... 'Performance (%)'); set(get(ax(2),'Ylabel'),'String',.... 'Testing Speed (images/sec)'); drawnow; end |
With:
trainfast(yes,0.333); leakage(indx)= 0.00:0.04:1.00; level=makelevel4(1,10,25,200,... groupmode,leakage(i),1:256,100,2,5); viewlevel4(level); |
Software Modifications Required
|
|
Mar 20: Performance vs. Testing Speed (ICCV 2007: Paper #1)
Initial Testing Procedure
[l,t,p]= deal(zeros(101,1),zeros(101,1),zeros(101,1)); for i=101:-1:1 l(i)= (i-1)/100; [y,x,cmats,c,ttest]=makelevel4(2,10,25,200,2,l(i)); p(i)= mean(diag(c))*100; t(i)= ttest; fprintf('\n\nPLEVEL %d (%f) complete\n',i,l(i)); plot(l,p,'bs'); hold on; plot(l,t,'rs'); xlim(-0.02,1.02); drawnow; end
Mar 17 : Better Taxonomies With Forced-Guessing?
The Idea
Instead of using the usual confusion matrix, always force it to make the nearest guess - but never the correct one. You could call this the Guessing Matrix.
This would look like a normal confusion matrix with all the diagonal elements equal to zero. One major difference is that you are now teasing more information out of the model by always forcing it to guess. You are telling the model, "If you couldn't get the best model, which would you choose instead?"
Implementation
The procedure would be just the same as normal, except for one thing. For each row of the matrix, the 256 one-vs-all classifiers are replaced with a 255 one-vs-all classifiers. You always leave out the class corresponding to that row.
No More Orphans Running Around On The Tree
Thinking about this gives me a clue as to why easy categories sometimes tended to cluster together in the taxonomy. With these categories all the affinity information is absorbed into the diagonal element so that the affinity with other items is artificially reduced. You are saying, "these categories aren't close to anything but themselves", making them antisocial. They are left as taxonomic orphans with no place to fit in (but then why do they cluster together?).
Using the best-guess matrix instead of the confusion matrix would solve that, making the normalization of the rows more consistent.
Come back to this issue next time I'm converting confusion (affinity) to distance for the taxonomic software.
Mar 16 : Meeting with Pietro
Mar 15 : Problems with Blank Images
This was for Alex's ICCV paper:
Category-3 Duplicate Removal
Trying to remove duplicates in /common/holub/images256/*/3 but the nearly-blank images are
causing dupkill
to crash. I concatinate the dupkill stdout of several
different process to ~/tmp/dupkill
and then:
cat dupkill | awk ' { if ($1=="entering") cat=$2; if (x==1) {x=0; print cat,$3} if ($2=="0") x=1; }' | sed "s/ '/\//" | sed "s/'//" > ~/tmp/blankfiles cat ~/tmp/blankfiles | sed 's/\//\/3\//' > ~/tmp/dupblanks cat ~/tmp/dupblanks | xargs xv cat ~/tmp/dupblanks | xargs rm -rf
I just wanted to save this syntax in case I need it later. For the record I deleted
fern/3/0118_goog_3.jpg frying-pan/3/image_0311_3.jpg horse/3/image_0219_3.jpg human-skeleton/3/0117_goog_3.jpg jesus-christ/3/image_0049_3.jpg baseball-glove/3/image_0344_3.jpg billiards/3/image_0444_3.jpg butterfly/3/goog_0296_3.jpg cactus/3/0066_goog_3.jpg CD/3/image_0255_3.jpg ladder/3/image_0184_3.jpg lightning/3/goog_0172_3.jpg megaphone/3/goog_0150_3.jpg microscope/3/0039_goog_3.jpg refrigerator/3/image_0093_3.jpg sextant/3/0151_goog_3.jpg sheet-music/3/0104_goog_3.jpg tripod/3/0189_goog_3.jpg chess-board/3/image_0275_3.jpg cockroach/3/image_0350_3.jpg
Update: 2 days later I had to delete 10 more files.
Mar 13 : Meeting with Antonio Torralba
Some notes:
- Try sorting 256 by most centered to least centered
- if someone suspects centeredness is corrupting things, you can jackknife your tests
- Using LabelMe dataset
- They're up to about 130,000 images annotated
- MATLAB toolbox is more extensive than their faq and documentation might suggest
- No object is ever manually segmented twice by different users
- Does this mean we can't get hierarchy info as with WhatWhere?
- Useful WordNet tools??
- Clutter
- Not clear that the negative of whatever is labeled is really clutter
- Different people label data more exhaustively
- Is clutter always just the negative of the set of (things we care about)?
- Must we always have examples of (things we don't care about)?
- Or should we define it using only a set of (things we care about)
- Exploit taxonomy
- Is LabelMe dataset useful for defining clutter?
- Not clear that the negative of whatever is labeled is really clutter
- Hierarchy
- Remember to cite Sharing visual features for multiclass and multiview object detection in my ICCV 2007 paper. This was one of the papers we read in vision class. He computes a small hierarchy of only 20 categories using shared features in a boosting context.
Some references:
- Wordnet hierarchy
- LabelMe and, in particular, the "Search objects" feature which is wordnet-aware.
Mar 4 : Software
With experiments usually running on a dozen machines at once, it gets cumbersome
to send all the data back to one place (where you can plot the results). I wrote routines look, share, take
and wipe
to finally deal with this once and for all. Also worked on makecascade
.
Mar 3 : Testing
Here's an example of testing the first level of the cascade, which guesses whether the image is in
- group 1
- group 2
- group 3 = neither
global train_mode train_fraction [train_mode,train_fraction]=deal(1,.33); [y10,x]=makelevel4(1,10,25,200,2,0.01:0.01:0.99,1:256,25,2,5);
It is worth using ntrails=25 instead of ntrails=10 (noticeable improvement).
Mar 2 : Are Binary Trees OK? (ICCV 2007 paper #1)
My whole method for the ICCV2007 paper assumes that it is ok to restrict outselves to a binary decision tree - for simplicity. My previous experiments suggest that this is fine: binary trees have the highest "goodness" value when using spectral clustering.
But it isn't so clear. Could we create more natural groupings with 3 or more branches at each level? This is what I'm trying to investigate with the new routine viewgroups4
.
All in all, after inspecting the below groupings I'm somewhat encouraged that dividing into two groups recursively yields similar results to dividing into 4 groups in one shot.
Plus this new routine just gives me a much better tool for visualizing grouping.
Below: creating 2-way (left) and 4-way (right) groupings of confusion matrix, based on spectral clustering method.
Observations:
- Using
ntrials=25
gives much more reasonable-looking categories thanntrials=5
. - If you unify groups 1+4 and 2+3 from the right, you get approximately the groups on the left.
- Main differences is that some categories like transport and structures tend to calve off and defect to the other group.
- Is this to make goup sizes more uniform?
Mar 1 : More Speed Tests
New Software Syntax
Usage: makegroups4(10,25,200,
2,2,
1:256,[],11,100,10,3);
where the two red arguments are passed as the 2nd and 3rd arguments to getgroups
.
In this case, we're selecting spectral clustering and division into two groups, which
will probably become the default.
Detailed Speedup Tests (dashed lines)
See (right) a slightly expanded version of yesterday's plot, where I try 25, 33, 50, 75 and 100%.
- For quick runs the sweet spot seems to be about 33%
- This yields 6 times the speed
- Performance difference of < 1%
- For a semi-final run 75% gives
- Speed increase more than 2 times
- Performance difference < 0.2%
Improvement (solid lines)
See solid lines. The difference is that I pick a single random set of negative examples instead of re-generating this 256 times. This actually performs better and takes less time.
Now 25% (red) looks quite tolerable for testing but it isn't much of a speed increase.
The sweet spot still seems to be train_fraction=.33
(green).
Feb 28 : Speeding Up Training Phase Of SVM
Summary
I used to be able to test 12.6 images/sec. Now we're up to 372 images/sec using changes from the previous 2 days.
- Testing is about 30 times faster and with no noticeable drop in performance.
- Training is about 3-8 times faster with little degradation in performance
- The exact speed increase depends on Ntrain
- Absolute performance difference < 1.0 % for all Ntrain
This is pretty dramatic.
Don't Need To Use All The Training Data
Using only 1/3, 1/2 or all the negative training examples in the one-vs-all classifier yields 23.5, 24.0 and 24.5% performance in 40.1, 62.2 and 139.2 seconds.
In other words, at Ntrain=10 you can speed training up by 3.5 times and keep 96% of the original performance. For higher Ntrain the performance differences is even larger. So this should be a very useful mode, especially when I'm just testing some code out for the first time.
Currently using global variables train_mode
and train_fraction
to control this behavior of matlab_code/svm_v0.55/@maxwin/train.m
example:.
>> global train_mode train_fraction >> train_mode=1; train_fraction=1/3; >> makesvm12(10,25,200,1:256); Getting classes... automatically Symmetrizing Training... time: 22.9 s Keeping... 18.7 % of support vectors Testing... time: 17.2 s Performance= 0.234687 >> train_mode=1; train_fraction=1/2; >> makesvm12(10,25,200,1:256); Getting classes... automatically Symmetrizing Training... time: 41.0 s Keeping... 14.1 % of support vectors Testing... time: 21.2 s Performance= 0.240469 >> train_mode=0; >> makesvm12(10,25,200,1:256); Getting classes... automatically Symmetrizing Training... time: 113.9 s Keeping... 8.6 % of support vectors Testing... time: 25.3 s Performance= 0.244844
Feb 27 : Speeding Up Testing Phase Of SVM
I'm now getting dramatically (more than 10 times)
faster testing speed with no drop in performance, just by stripping
out support vectors whose weight is less than tolerance.
This required creating a @maxwin/strip
routine to call the
corresponding @svc/strip
routines.
Feb 26 : Cascade Testing (ICCV 2007 paper #1)
Cascade Software : 3rd Step
Indexing has been half the battle...
This part will be much less complicated if I can redesign SVM routines to deal with non-consecutive class numbers. Ie, classes can be "3,4,13,15,45" for a particular group. Otherwise it becomes the job of makegroups to keep repeatedly re-mapping the category numbers into consecutive integers - which you then have to untangle at the end.
Checked in a new routine: makesvm12
which builds on makesvm11
.
Use replace
and reindex
, which is tested separately in reindex_test.m
(this is a good example of how both routines work).
Here is the relevent segment:
x1= ceil( rand(256,1)*10 ); x2= ceil(rand(256,1)*10); map= reindex(x1,x2); % new indices should be consecutive y1= replace(x1,map); y2= replace(x2,map); % reconstruct the original indices (as a test) rmap= fliplr(map); z1= replace(y1,rmap); z2= replace(y2,rmap);
This is run hundreds of times and then a series of tests follow, to make sure it worked.
Feb 23-24 : Cascade Testing (ICCV 2007 paper #1)
There are 2 steps each time you call makelevel
:
- makegroups (160 sec @ Ntrain=10)
- Create two mixed classifiers 0 and 1
- makecascade (10 sec @ Ntrain=10)
- Assign each test image to one of the two groups (0 or 1), or neither (2)
Finally when you're done recursively making as many levels you need,
- makenodes (time/performance depend on parameters)
- Now classify test images in each leaf separately (10 sec)
- This is now faster since you only have to try roughly half the possibilities
Let's see how it is all working together, and see what the speed/performance looks like.
Note: the makelevel
routine will run makecascade
recursively. For now there is only one level, so makecascade
and makelevel
do the same thing.
Cascade Software: First 2 Steps
Syntax from Jan 23 seems to still work after change to makesvm11
that
I made on Feb13. Note addition of a new 1st parameter stage
which tells it how much it needs to re-run (saves time). With the optimal parameters I found before, that's:
[y10,x]=makelevel4(1,10,25,200,... 2,0.01:0.01:0.99,1:250,10,2,5);
What makelevel4
currently does:
- Generate mixed groups 0 and 1
- Runs
makegroups4
in about 160/480/760 sec - In this example, group sizes were 155 and 95
ntrial=10
: repeat and averagentrail
times.- higher value of
ntrial
could help?
- higher value of
nver=2
: leave-nver
-out confusion matrices w/ training setnskip=5/2/0
: randomly thrownskip
rows/columns out of each training category
- Runs
- Assign each test image to one of the two groups, or neither
- Runs
makecascade4
in about 10 sec - Uses a range of leakage thresholds
plevel=0.01:0.01:0.99
- Runs
Best parameters
Turns out there is no benefit to using anything less than nskip=5
.
I just did two runs at each value so this might not be exactly true. Should to ~100 runs overnight and average them some time. There is a fair amount of variance, ~2-3%.
For now I'll stick with the above syntax ending in 10,2,5
.
Good enough.
Feb 22 : ICCV + Tech Report
ICCV 2007
More makecascade
programming.
Tech Report
This illustrates how the Caltech-256 really is more challenging than the Caltech-101. Pyramid matching performance on the Caltech-256 is worse than on the Caltech-101 even when using the same number of Ntrain.
See Nov 16 entry below (in this notebook). Is this too confusing to explain?
Need to clean up axes.
Feb 21 : ICCV/NIPS 2007 Plan
Made a chart of 4 papers I could work on, plus some other topics.
This lists two ICCV 2007 papers and two NIPS 2007 papers. Right now paper #1 is taking all my time. Need to finish this before thinking about paper #2.
Need advice from Pietro on where to go from here. Andrea set up meeting for Tues Feb 26 10am.
Feb 20 : Odds and Ends
Email Requests To Work On
- [Dave Martin]
- questions about caltech-256 protocol. Send him tech report.
- [Marcin Marszalek]
- wants hand-made caltech-256 hierarchy from paper. Export to XML?
- [Pranab Mohanty]
- wants to use our spatial pyramid matching code for generating pairwise affinity matrices. In tar format. Gulp-- this is going to take a little documentation. I can start with the documentation I wrote for Alex.
Tech Report
Found this useful postscript command for changing gamma, that I want to remember:
{ 1.5 exp } settransfer
Put it right after %%Endprolog
to brighten figure 9 in the
256 paper.
There's a new "performance by category plot" which
hopefully Alex will like better (see right). It is
generated with ~/256/catsort/catsort2.m
.
Feb 19 : Feedback from Pietro and Alex
From Pietro | From Alex |
---|---|
Thank you Greg!
I took a quick scan and it looks fine. Maybe just a couple of comments: a) Measuring the performance of the `object' detector at 10% false accept rate is dangerous. We know that perhaps 1 in 100 or 1 in 1000 patches are objects. So: we should really measure the performance at 0.01% false alarm rate. In the conclusions we should say that this is a very tough challenge, but that it needs to be tackled in the next years. b) presenting the confusion matrix for 250 categories is confusing: why not always stick to Caltech 256. Are the curves in Fig 12 describing reesults on 256 or 250 categories? Etc. P. |
From Alex:
Hi Greg! Looks great! Figure 8 took me a while to parse (maybe I'm just slow;) ) -- but maybe add a bit more to the caption? Might add something like this: 'Performance of all 256 object categories using a typical pyramid match kernel in a multi-class setting using Ntrain = 30. Performance for each category corresponds to the diagonal entry of the confusion matrix. The best performing categories are shown in the upper right. The worst performing categories are shown in the lower left. Vertical ... ' What's confusing is that at first I thought you were comparing the categories on the left ot those on the right -- then I realized it was just a list of all categories ... |
Response to Pietro's question a)
Agreed. But how about if we quote 0.1% instead?
For one thing, we have so few clutter files that 0.01% is really beyond our measurement resolution.
For another thing, the pyramid match algorithm is fairly scale-invariant so I suspect we would not have to scan over so many windows. Other algorithms might have to scan extensively in both position and scale. We could just pick one or two fairly conservative window sizes and overlap them a little bit.
I probably picked 10% because I wanted high performance! But not entirely.
The other way to think of this is as just the first, fastest layer of the cascade. It cleans up the obvious cases so you can focus on harder cases.
Here's what I'm imagining, at least for the future:
Response to Pietro's question b)
Note: in the end I just stuck with 256 categories and threw out the whole 250 thing, for simplicity
I too am worried that we'll confuse people too much with the whole 250/256 thing. Is it just too complicated?
I was originally trying to make things less confusing. There's something in the Caltech-101 which always bothered me, which I wanted to clear up with the Caltech-256.
Faces-easy and Faces are essentially the exact same pictures: should we include BOTH categories in the final performance figures, or not? Alex throws one out I think. I do too. Why penalize performance because the classifier can't discriminate between two things that are essentially the same?
But there is no agreement on this. Some other people probably keep both categories. Can we honestly compare our results with there's? A small effect, but still bothersome. We're not all on the same page.
Ditto for cougar_body and cougar_face, and crocodile and crocodile_head, flamingo and flamingo_head. These we never throw out. But most of the flamingo pictures have a head, right? Is the classifier really wrong to choose flamingo_head?
In 256 I wanted to use the indices to clarify this, and separate the 6 confusers from the rest of the topics. The confusion of these 6 (ie. the generality) is still an informative tool. It helps people imagine what their algorithm is really doing. It's a sanity check. But those 6 confusers just doesn't seem like a fair test of performance.
Will the 6 confusers confuse people so much that it causes benchmarking chaos? Maybe. I'm just pointing out that we already have confusers in the Caltech-101. We just mix them all up with the rest of the set.
So now I'm not sure what to do. Help??
Response to Alex
I've added your improved caption. I just couldn't fit so many labels on one side of the plot! It is a little confusing. Another option is to
- arrange them all in a big circle
- keep left and right side the same color (so that they don't seem like different things)
I'll fool with it a bit and see what happens.
Feb 15-17 (Thu-Sat) : What is the interest detector doing?
The Story So Far
Last time (Feb 12) I said I wanted to try these tests:
- Is it overtraining on our choice of positive and negative examples?
- Use Caltech-101 as positives and Caltech-6/Caltech-101 clutter as negatives.
- Is it relying on geometry, such as centeredness of the objects?
- Use bag of words instead of pyramid matching.
- What do the false positives / false negatives look like?
Let's tackled that last one...
False Positives And False Negatives
Call "positive" something that is interesting and "negative" something that is uninteresting.
Continuing from the earlier syntax,
pos= 2-truths; neg= ~pos; true= (rating>=0) == pos; false= (rating< 0) == pos; fpos= find( rating>=0 & neg ); fneg= find( rating< 0 & pos ); [tmp,indx]= sort(rating(fpos)); mosaic(4,10,files,fpos(indx),1.2,rating); [tmp,indx]= sort(rating(fneg)); mosaic(4,10,files,fneg(indx(1:40)),1.2,rating);
Analysis
There's seems to be a pattern. Many of the false negatives bear a resemblance to one or more of the clutter images (see plot, above to the right):
cd ~/256/images fclutter= pickfiles(getcategories(257)); fcompare= { fclutter{[85 128 145 396 471 617 647 734]} files{fneg(indx([39 15 36 2 25 17 38 4]))} }; mosaic(2,8,fcompare,1:16,1.2);
With these sort of examples (top row) floating around in the clutter database, most of our false negatives (bottom row) are pretty understandable, no?
Would boosting help fix this (leave-one-out of training data)? Doubtful.
Ultimately we may need a human in the loop. Stop thinking of our training sets as the final word. The training set is merely a starting point from which to learn what is and isn't clutter. Then the algorithm needs to ask questions. Do children have fixed training sets?
I guess this is really what Alex's CVPR07 paper is all about?
Summary
For many of the above false negatives I can find direct analogs in our clutter category which may have caused the false detection.
To me this test implies that many of the misidentifications are really just due to imperfections in our clutter database. We call all these cropped images clutter even though there are examples of images that could be interesting or clutter depending on your point of view.
In other words: an interest classifier is only as good as its training data .
We get so wrapped up in these silly fixed training sets. Like they are the word of god. We expect our classifiers to (some day) generate >95% performance. But that's ridiculous. The above plot really makes it clear for me: you'll never get there. You can't get there. Not without a human to correct the (inevitable) defects that are in the training set.
Fixed training sets are OUT. Learning is IN.
Feb 14 (Wed): Comparing "interesting" detectors (con't)
Interest-Detector Shootout
For the 256 paper I settled on 3 examples that sort of tell the story:
- Detector C - Bad
- standard Ntrain=30 for all categories, multi-category classifier
- Detector A - Good but too Slow
- same as C except Ntrain=512 for background category. The classifier needs more examples of this to do a good job.
- Detector C - Reasonable Performance and 200 times Faster than A and C
- 2-category classifier shows you can roll a bunch of "interesting" images into one big category. Some people have told me this wouldn't work! And maybe it doesn't. But I'm somewhat encouraged.
This from the 256 paper (see ~/256/paper/roc/roc2.m
):
Summary
Detector C is encouraging. But is it overtraining on something silly, like the average scale of the objects or the photography style of Stephen Shore?
Feb 13 (Tue): Comparing interest detectors
Code
Modified makesvm11
so that it could truncate match file. This makes the third
line possible without fully regenerating the histograms/match files (see below).
Syntax (for the 3 methods above):
[FPOS,TPOS,marg12]= makeroc2(0,[2],'~/256/roc/1','30_4_512_256',1:256,257,30,4,512,256); % A==green [FPOS,TPOS,marg12]= makeroc2(0,[1],'~/256/roc/1','512_256', 1:256,257, 2,1,512,256); % B==black [FPOS,TPOS,marg12]= makeroc2(2,[2],'~/256/roc/1','30_4_512_256',1:256,257,30,4, 30,256); % C==blue
Modified makeroc2
syntax:
>> help makeroc2 MAKEROC2 usage: function [FPOS,TPOS,marg12,guess,truth]= makeroc1(stage,method,subdir,suffix,class_indx,bkgd_indx,ntrain1,ntest1,ntrain2,ntest2); stage 1 and above generates histograms and match files stage 2 and above generates svm: method 1 runs 2-class method 2 runs multi-class stage 3 generates the ROC curve
Feb 12 (Mon) : What is the interest detector doing?
Here the red numbers are the difference in the margins, which we will call the interestingness. Least and most interesting images are sorted from top-left to bottom-right:
[FPOS,TPOS,marg12,guesses,truths]=makeroc1(2,[1],'~/256/roc/1','512_256',1:256,257, 2,1,512,256); rating=marg12(:,1)-marg12(:,2); [r,i]= sort(rating); load match512_256.mat ftest files=ftest; for j=1:length(ftest), files{j}(end-(2:-1:0))='jpg'; end cd ~/256/images mosaic(4,10,files,i(1:40),1.2,rating); mosaic(4,10,files,i(end-fliplr(1:40)-1),1.2,rating);
Here is a histogram of the interestingness rating. The red lines show where these two sets of images are taken from, in the overall distribution:
hist(rating,50); hold on; vline(-2.38,'r-'); hold on; vline(3.01,'r-');
What Does It Look Like It's Detecting?
It is not yet clear what the interesting classifier is actually detecting. Here are a few guesses along with ways of testing each guess.
- Low or high frequency? It doesn't look that way to me. Could measure how sensitive the classifier is to scale.
- Common features? Ball-like edges, boxy corners, shadows... what?
- Proximity to center? Try a bag-of-words model which does not know where the center is... do I still get the same results?
- Specific object categories from the 256? Try it on 101 images instead.
- Picture style of Stephen Shore? Things like brick, sky, etc? Try using Caltech-6 for clutter instead.
- Artifacts such as white frame around object.
- Any other theories?
Summary
The best things I can think of to try next are:
- Is it overtraining on our choice of positive and negative examples?
- Use Caltech-101 as positives and Caltech-6/Caltech-101 clutter as negatives.
- Is it relying on geometry, such as centeredness of the objects?
- Use bag of words instead of pyramid matching.
- What do the false positives / false negatives look like?
Feb 8-9 (Thu-Fri): Generality
This from the 256 paper (see ~/256/paper/results/generality.m
):
Feb 6 (Tue) : Caltech-256 Paper
Incorporating the below ROC curve into the 256-paper, and writing about it.
Feb 5 (Mon) : Interest detector seems to be working
[fpos,tpos]= makeroc1(1,'../roc/2','512_256',1:256,257,2,1); semilogx(fpos*100,tpos*100); set(gca,'xticklabel',{'0.1','1.0','10','100'});
By changing the first flag "1" to "2" you can keep from regenerating the histograms (if this has already been done).
I ran this on 6 different machines and averaged the results to
get the plot to the right (which is now in the Caltech-256 tech
report). See ~/256/paper/roc/roc.m
.
This curve shows, for example, that we can accurately detect 86% of the interesting (non-clutter) images with a 10% error rate. In other words, 10% of the images classified as interesting will actually contain clutter instead.
Putting this plot in the Caltech-256 Tech Report.
So the interesting-ness classifier seems to be working, although it is not clear yet exactly how it defines
interesting. Something intelligent, or something trivial?
Feb 2-3 (Fri-Sat) : Caltech-256 Paper
Finishing up Caltech-256 paper. New figures. Completely reordering the sections.
Jan 31 (Wed) : ROC Curves for Multi-Class Classification
The Code
Checked in new roc/makeroc1.m
.
Jan 30 (Tue) : Using Clutter Category To Implement Attention
Here's the main idea.
Consider the example of a Mars rover that moves around in its environment while taking pictures. Raw performance only tells us the accuracy with which objects are identified. Just as important is the ability to identify where there is an object of interest and where there is only uninteresting background. If background is constantly misidentified as an object, the rover cannot begin to understand its environment.
The rover example also illustrates how the meaning of the word background is strongly dependent on the environment and the application. Our choice of background images for Caltech-256 is meant to be reflect a variety of common (terrestrial!) environments.
Here we propose a metric that tests the ability of a classification algorithm to identify a lack of interesting categories. It is based on the familiar ROC curve where the percentage of true positives is plotted against the percentage of false positives. In single-category detection the meaning of true positive and false positive is unambiguous. Imagine that a search window of varied size scans across an image, employing some sort of bird classifier. Each true positive marks a successful detection of a bird inside the scan window while each false positive indicates an erroneous detection.
What do positive and negative mean in the context of multi-class classification? Consider a two-step process in which each search window is evaluated by a cascade of two classifiers. The interest classifier detects either 1) an object of interest or 2) background. The background images are thrown out. Finally, the usal multi-class classifier identifies the specific object category.
Our ROC curve measures the preformance of the interest classifier only. A false positive is any clutter image which is misclassified as containing an object of interest. Likewise true positive refers to an object of interest that is correctly identified as such. Here object of interest means any classification besides clutter.
Our interest classifier is trained on an equal number of clutter and non-clutter images. First take Ntrain images at random from each category 1...256. These are assigned to class 1: interesting. Next a set of 256 . Ntrain images are chosen from background class 257. The threshold is adjusted by weighting the margins returned by the SVM classifier, resulting in the ROC curve. Different curves are generated by changing the value of Ntrain. We will refer to this classifier as a super-classifier because it uses examples from many different object categories. This corresponds to the super-category that is called object of interest.
Jan 24 (Wed) : Slides Slides Slides
Behold my beautiful slides for Pietro. He didn't end up using them but they could come in handy for some future Caltech-256 talk.
Jan 22-23 (Mon-Tues) : Cascade Testing vs. Ntrain (ICCV07)
At this point I've discovered some things (and fixed some problems) in the topdown v4
and svm v11
software, which ends up speeding things up by an order of magnitude. So here we go with testing:
Comparison
How best to compare different values of Ntrain? Below I make an arbitrary choice that, regardless of the training set size, only 5 training images will be used to create the confusion matrix for each of the 10 trials.
As a result each makelevel
routine takes ~2.5 minutes to execute.
[y05,x]=makelevel4(5,25,200,2,0.01:0.01:0.99,1:250,10,2,0); [y10,x]=makelevel4(10,25,200,2,0.01:0.01:0.99,1:250,10,2,5); [y15,x]=makelevel4(15,25,200,2,0.01:0.01:0.99,1:250,10,2,10); [y20,x]=makelevel4(20,25,200,2,0.01:0.01:0.99,1:250,10,2,15);
Summary
I'm picking fairly optimal ntrial,nver,nskip
for Ntrain=10 but not for 5,15, and 20.
The above comparison strategy just has to be changed.
Jan 20 (Sat) : Even Faster Training
Fixed Two Speed Bottlenecks
The 154 MB file match020_025_200.mat
seems to load off network in 15 seconds.
So big files on the network drive are not the bottleneck. Could be memory - load matrices one
at a time? Maybe this is not the bottleneck. How about memory? Try only loading one of the two matrices at a time, then clearing them?
No, I finally discovered the bottlenecks:
- Was symmetrizing the training matrix in
makesvm11
inside 2for
loops. getlearner11
was rescaling/offsetting (by 1.0, 0.0) when it didn't have to
This 2nd part turns out to have been the real time-waster.
Now the 3.3 min training take 2.3 min instead.
Keep Going
Pushing even further, we can train in 1.5 minutes with about a 1% drop in performance. The
parameters here are ntrial,nver,nskip=10,1,7
and performance is 86.4% @ 50% leakage.
This probably is not worth it, since the makecascade
execution time begins to dominate.
It is worth noting that the latest (tuned up) routines get 75% or better performance @ ntrain=10 for classifying test images into 1 of the 2 super-groups even when plevel=0 ie. 100% leakage. This means that you force yourself to classify each and every test image. Of course as plevel increases we can get even better performance, because you are shunting the hard-to-classify test images by making no classification at all on them. So 75% is merely a lower bound on performance.
makesvm11
Instead of removing nskip
training images just once, I now do it repeatedly
(and randomly) for each trial. Ran a few trials; this seems to increase performance. But it
is hard to say without dozens of trials (because of the variance in performance between
trials).
Also, I don't know if this matters, but I was not executing randomize
at the beginning.
Jan 19 (Fri) : Faster Training: Why Confusion Can Be Good
I mistakenly used nskip
in makegroups4
instead of makecascade4
.
This lead to an interesting discovery: you can reduce the training set size from 10 to 5 when creating super-groups 1 and 2, improving the training speed by 7(!) with no reduction in performance.
See plot at right. Blue curves each required 25 minutes of training time, as compared to 3.3 min for the red curves. No noticable drop in performance.
[y_blue,x]=makelevel4(10,25,200,2,0.01:0.01:0.99,1:250,10,3); [y_red,x]=makelevel4(10,25,200,2,0.01:0.01:0.99,1:250,10,2,5); [y_green,x]=makelevel4(10,25,200,2,0.01:0.01:0.99,1:250,10,1,5); [y_black,x]=makelevel4(10,25,200,2,0.01:0.01:0.99,1:250,30,2,5);
Explanation
Perhaps you need a high number of categories (ie context) to establish a hierarchy, more than you need a high number of examples from each category.
In fact a high number of examples may even hurt you! The confusion matrix is our one and only tool for establishing the distance between categories (and thus the hierarchy). If you have more training examples, you are likely to become less confused so that the hierarchy is harder to establish.
Alternative explanation: as you increase the number of training examples you encourage the confusion matrix to lock-on on finer and finer distinctions between images in categories. These details detract from the big picture. In other words, if our method is inherently imperfect, having a "blurry" picture keeps you from seeing these imperfections.
Summary
Drastic speed increase (training) with no noticeable drawback. How can this be?
I'm starting to understand something important here. We are used to thinking of confusion as being bad, because we want high performance. Thus a high number of training examples is good.
But for establishing a hierarchy, confusion is actually a good thing. So a high number of training examples can actually hurt you, and slow you down at the same time! .
Here's an example. Suppose you had an infinite number of training examples, and your confusion began to approach 0. With no confusion, there would be nothing to use to construct the hierarchy!
Have yet to try this with, say, ntrain= 5, 15 or 30 but it should be interesting to see.
Jan 18 (Thu) : Software
makelevel4
cd ~/256/hist8b/1; makegroups4(10,25,200,2,1:250,[],11,100,10,3); [y308,x]=makecascade4(10,25,200,0.01:0.01:0.99,1:250);
is equivalent to the new (simpler) syntax:
[y308,x]=makelevel4(10,25,200,2,0.01:0.01:0.99,1:250,10,3);
While adding this macro, good place to freeze code and advance from version 3 to version 4.
So far only functional difference is that makegroups4
will allow you to
increase the speed by using only a fraction of the training set during super-classifier
training. This is done with the new last parameter, nskip
:
[y308,x]=makelevel4(10,25,200,2,0.01:0.01:0.99,1:250,10,3,2);
The behavior here would be to throw nskip=2
images out of each class in
the super-classifier.
Jan 17 (Wed) : Software
makesvm11
I froze makesvm10.m
and created makesvm11.m
specifically for
adding a new capability: using only a small fraction of the available training set.
This is really only useful when training on super-classifiers, where there is a large pool
of training data and using all of it is too slow.
More self-consistent getlearner11
and getclassifier11
function
names, in place of svmlearner10
and svmclassifier10
.
future
On my TODO list of other useful tools and modifications:
- File load routine automatically searches parent directories
- Save
match
files in a more compact format- Run times often dominated by time to read off network drive
- Method 1: remap singles to 8-bit precision
- Method 2: reindex symmetric matrices
- Reduce size by 2 or more?
- Routine to automate putting current MATLAB figure on wiki
- Routine to empty/add/plot to a shared scratch space
- Run results on many machines
- Quickly gather and plot results on one machine
Directory Issues
Small problem is that I cannot (currently) run all the routines in the same
directory, because they all try to write and load their own version of
group10_25_200.mat
. For now I've created separate subdirectories
for each machine, such as
cd ~/256/hist8b/1/topdown/308
At the moment, each sub-directory needs its own link to the match file:
for dir in 3* ; do cd $dir echo $dir ln -s ../../match010_025_200.mat match010_025_200.mat cd .. done
Jan 16 (Tue) : Cascade Tuning (ICCV07)
More Parameter-space Searching
ntrial | nver | Performance @ plevel=0.5 |
10 | 3 | ~89% |
10 | 1 | ~86% |
5 | 3 | ~83% |
So far I'd use 10,3 when in a hurry and 30,3 for final results.
These take on the order of 25 min and 1hr 20min (respectively).
Next want to experiment with only training super-groups on a fraction of the total data.
Jan 15 (Mon) : Cascade Tuning (ICCV07)
Further tests show that ntrail,nver= 100,3
performs 1 or 2% better than 30,3
while
30,5
performs about 1% worse (on average).
Jan 12 (Fri) : Cascade Tuning (ICCV07)
Plot at right indices that, for Caltech-256, splitting into one of 2 super-groups (level-1) can be achieved with about 91% accuracy at a leakage of 50%.
- This is pretty encouraging. Certainly good enough performance to get started.
- For Caltech-256 Spectral Clustering works best for choosing super-groups.
- For Caltech-101 Monte Carlo Clustering works best. I'm assuming this is because Spectral Clustering is creating unbalanced groups?
Performance is now quite a bit improved, after properly accounting for the prior (ie. imbalanced sizes) of the two super-groups.
It would be interesting to analyze
- Which categories have the most classification errors, and why?
- Smaller training set (use a sub-set of the images from each category: speed? accuracy?)
- Pairwise vs. one-vs-all
- What if there is an n-way split instead of a 2-way split?
Jan 6-11 (Sat-Thu) : Studying
Studying for final part of quals.
I shall disappear.
Jan 4-5 (Thu-Fri) : Memory Problem Fixed
With the old matching kernels, I ended up having to
flip them 90 degrees in makesvm10
. They are so
big that this causes the machines to swap or crash. This
has been an annoyance for a long time.
Recomputing Kernel Match Matrices
Rerunning all the kernel matrices for every single experiment in
~/101/hist8b
and
~/256/hist8b
. Fortunately none of the machines are
being used right now. Even on ~15 machines at once it's going to
take about 3 days. But in the long-run it will speed up the
experiments quite a lot.
Now makesvm10
will use half as much memory.
This should eliminate the problem.
Jan 2,3 (Tue,Wed) : The Big Code Cleanup
Winter Cleaning
I'm up above 30,000 lines of matlab code at this point. Much of it was things I was trying but ended up not using, algorithms I replaced with something faster, etc.
Need to get rid of some dead wood.
Refactoring
Subversion comments, revision 307.
Note: ask Alex about pairwise vs. one-vs-all performance.
Main change was gutting svmclassifier10 and svmlearner10, getting rid of modes I don't use any more. I can always go back and pull these out of svm*7 if necessary, but for now it simplifies everything. Changed makematch to create a 90-degree-rotated version of the two match matrices it made before. Trivial change... but it means that makesvm10 doesn't have to flip them anymore. They were so big that flipping was becoming impossible to fit in memory. Slow. Machines were swapping. Now hopefully I can handle N_train>50? Also it was just inconvenient to always keep track of whether you're indexing the pre-fipped or the post-flipped matrices. So I have pairwise SVM mode working now, except it still doesn't work as well as maxwin and I'm not sure why. Not sure if dagsvm works: probably not, but I started looking at it. It would be nice to debug both these methods so that they can be compared against one-vs-all. They are also faster (most of the time). Conceptually (in the paper) it is going to make more sense to compare tree comparisons to pairwise (one-vs-one) comparisons. Then you're just saying, "these are blocks of the one-vs-one match that we're just going to cut out, because they aren't on the same branch". What is going on with the clustering procedure has a direct analog with what pairs are compared. With one-vs.-all (which is what I'm doing now) this will take more words to decribe. Anyway, I'd just be more comfortable if one-vs-all and one-vs-one didn't differ in performance by ~ 5%. --This line, and those below, will be ignored-- M matlab_code/topdown/makecascade3.m M matlab_code/misc/makesuffix.m M matlab_code/match/makematch10.m M matlab_code/svm_v0.55/@dagsvm/fwd.m M matlab_code/svm_v0.55/@pairwise/train.m M matlab_code/svm_v0.55/@pairwise/fwd.m M matlab_code/svm_v0.55/@maxwin/fwd.m M matlab_code/svm/makesvm10.m M matlab_code/svm/svmclassifier10.m M matlab_code/svm/svmlearner10.m
Dec 30 (Sat) : Improvement
Improvement
There was a potential problem with makecascade3.m
. The plot below (Dec 29) use 25% regions on both
the left and right sides of the margin distribution. That's fine... if the group sizes are equal.
In some Caltech-101 groupings this was not the case.
Now there is an extra bias
parameter that sets the correct prior for super-groups 1 and 2.
This may explain why the results from Dec 28 were a bit erratic.
Dec 29 (Fri) : makecascade
(ICCV07)
Leakage Threshold
I want to illustrate a little more clearly exactly what is going on with the x-axis in the below plot (from Dec 28). The term leakage here refers to the fraction of data that gets passed to a lower level of the cascade. For example if leakage=0.5 then exactly half the test images will be classified into one of the two sub-categories. This half will benefit from a speed boost, since we only need to check the image against the other categories in its superclass.
The other half of the test images are not passed down. They do not benefit from a speed increase - they must be checked against every single category.
So why not pass all the test images down one of the two branches (leakage=1.0)?
The plot to the right shows the margins for classifying test images into one of the two
branches. A large positive or negative margin indicates high confidence. A small
margin indicates low confidence. Picking just the fraction with the
highest confidence gives us a better chance of making accurate predictions.
Only the middle region is left unclassified; we are just admitting that those test
images are too close to call. As the leakage threshold gets smaller, makecascade
becomes more and more picky about what test images it will try to classify.
A leakage threshold of 0.0 tells makecascade
to leave all test images
unclassified. This is the same as having no cascade at all.
Summary
Changing the
leakage parameter lets us select for higher speed or higher performance.
There is a direct tradeoff.
Dec 28 (Thu) : makecascade</ocde>
Today's tests show us that the super-classifiers work. At least they give results that
are in the right ballpark. This is a good sanity check: otherwise there would be no
point in going on with further tests.
First Test of Cascade Performance
Here is the test (on 8 different machines). Note that makecascade
is linked to makecascade3
:
cd ~/101/hist8b/1; makegroups3(10,21,200,1,2:101,[],11,100,10,3); [y111,x]=makecascade(10,21,200,0.01:0.01:0.99,2:101);
cd ~/101/hist8b/1; makegroups3(10,21,200,1,2:101,[],11,100,30,3); [y112,x]=makecascade(10,21,200,0.01:0.01:0.99,2:101);
cd ~/101/hist8b/1; makegroups3(10,21,200,2,2:101,[],11,100,10,3); [y121,x]=makecascade(10,21,200,0.01:0.01:0.99,2:101);
cd ~/101/hist8b/1; makegroups3(10,21,200,2,2:101,[],11,100,30,3); [y122,x]=makecascade(10,21,200,0.01:0.01:0.99,2:101);
cd ~/256/hist8b/1; makegroups3(10,25,200,1,1:250,[],11,100,10,3); [y211,x]=makecascade(10,25,200,0.01:0.01:0.99,1:250);
cd ~/256/hist8b/1; makegroups3(10,25,200,1,1:250,[],11,100,30,3); [y212,x]=makecascade(10,25,200,0.01:0.01:0.99,1:250);
cd ~/256/hist8b/1; makegroups3(10,25,200,2,1:250,[],11,100,10,3); [y221,x]=makecascade(10,25,200,0.01:0.01:0.99,1:250);
cd ~/256/hist8b/1; makegroups3(10,25,200,2,1:250,[],11,100,30,3); [y222,x]=makecascade(10,25,200,0.01:0.01:0.99,1:250);
I vary three things (underlined):
makecascade3(10,ntrain,200,classes)
21, 2:101
: Caltech-101 (blue line)
25, 1:250
: Caltech-256 (red line)
makegroups3(10,25,200,method,1:250,[],11,100,ntrial,nver)
1
: Monte Carlo Clustering (dashed line )
2
: Spectral Clustering (solid line _______________)
makegroups3(10,25,200,method,1:250,[],11,100,ntrial,3)
10
: fast verification (dim line)
30
: slow verification (bold line)
where is the fraction of the
data that leaks to the next level, in a 1-level cascade.
In other words, if then there is no
cascade while forces all data to
cascade (into either the left or right branch).
Transfer all the y's from all the computers onto one machine, and then:
h=plot(x,y111,x,y112,x,y121,x,y122,...
x,y211,x,y212,x,y221,x,y222);
label={};
for i=1:length(h)
dataset= bitand(bitshift(i-1,-2),1);
clustering= bitand(bitshift(i-1,-1),1);
verification= bitand(bitshift(i-1, 0),1);
if dataset == 0
color= 'b' ; label{i}= 'Caltech-101';
else
color= 'r'; label{i}= 'Caltech-256';
end
if clustering == 0
style= '--'; label{i}= [ label{i} ' MC' ];
else
style= '-'; label{i}= [ label{i} ' SP' ];
end
if verification == 0
label{i}= [ label{i} ' 10' ];
else
label{i}= [ label{i} ' 30' ];
end
set(h(i),'linestyle',style,'color',color);
set(h(i),'linewidth',verification+1);
end
legend(label);
Summary
This plot is a little busy but a few things are becoming clear:
- Cascade performance is already good enough to try full-blown tests
See solid red lines. About 90% of the super-category classifications are correct
at a threshold of 50%. That is, when 50% of the data can be classified
into 2 smaller (more efficient) subgroups. This sort of cascade is
likely to have significant speed increases with only a 10% drop
in performance. And we aren't even trying hard yet.
- Performance can be improved even more.
There are lots of things I'm not even trying yet. How about:
- Explore more of the parameter space (Ntrain, Nver )
- 0-th super-category consisting of categories that we won't try to super-classify
- Find the background data that does not fall into either cluster
- Isn't this what Marco's paper is about?
- Do his algorithms apply?
- More than 2 branches, ie. a more specific super-categories.
- Use lowest optimal number of clusters found by spectral clustering?
- Analyze what categories are being mis-classified the most
- Determine through verification subset of the training set?
- Look for patterns, some way to predict the "trouble" categories.
- Different cascade geometry and/or algorithm?
- The more categories the better the performance
In other words, context helps.
Caltech-256 performed better than Caltech-101 in the above test.
That is, a cascade made up of 2 super-categories is more accurate when
there are more categories in the dataset.
Again think of a 3-year-old child. Give her examples of 3 visual categories
and it is hard to construct their hierarchical relationships. But give her 30 or 300
and it is more likely that she will begin to perceive the larger patterns
that they fall into.
Dec 27 (Wed) : makecascade3
How Does Our Cascade Work?
To recap, makegroups
starts by creating a confusion matrix (with the training
data only) and clustering it into two super-groups.
Spectral clustering seems to work
better than Monte Carlo clustering for this - but not always.
If a human being created two super-groups, they would probably something like
"animate objects" and "inanimate objects". The makegroups
does something similar:
animate objects tend to be on a single branch. But the distinction is not so cut-and-dry. For
example, kayaks may be grouped with dolphins. Or birdbaths may be
grouped with birds. This isn't necessarily a bad thing - as long as the resulting
super-groups correspond to useful classifiers.
These
super-classifiers are constructed by pooling (a large number of) images from the two super-groups.
The hope is that they would be able to correctly classify out-of-sample test images as belonging to
one of the two two super-groups. Then we could pass them down two separate branches, using makecascade
.
It is convenient to wrap both steps inside a single function.
makelevel
makegroups
: create super-groups using only the training data
makecascade
: classify test images are belong to one of the two groups
See figure (right). Further levels are now created by running makelevel
recursively.
Leakage Threshold
Tradeoff between speed and performance will come from adjusting the leakage threshold. This
determines how many test images get passed to the next level. We only pass the test images we are
most sure about, ie. the SVM gave it a very large margin.
So if leakage=1.0 then every image is passed down to one of the two super-groups. If leakage=0.0 nothing is passed and it is exactly like what we normally do.
Primary Goal: Speed
The reason for using such a cascade is to speed things up. Once you have a rough idea what
super-group your test image is in, you no longer have to test for images from the other super-groups.
The more recursive branches you have, the less testing you need to do down at the leaves.
Secondary Goal: Performance
It is also possible that the performance (ie. accuracy) will be increased. Consider an individual
test image which is
passed down the cascade. Even if it is misclassified at one of the branches, the classification
accuracy should be higher once we get down to the leaves (because we'll only need to test for
a limited subset of categories). It is not yet clear how these two factors balance out.
Dec 26 (Tue) : makegroups3
See the Top-Down Classification Roadmap below if you want
to know where this fits into the overall analysis pipeline.
It is the half of makelevel
that uses only
the training data. Later in makecascade3
, two
mega-classifiers are then be constructed from these two groups.
Syntax:
makegroups3(10,25,200,1,1:250);
Have found defaults for nver
and tver
that
seem reasonable in makegroups3
, at least for testing:
nver= ceil(100/ntrain);
tver= round(ntrain/4);
svmdefaults{5}= nver;
svmdefaults{6}= tver;
In other words, 25% of the training data is treated as test data,
and used to construct a confusion matrix that is used for grouping
purposes only. This procedure is repeated about
times. Note: this ratio is to be used for quick and dirty testing only.
Remember to increase this for the final tests.
The variables that are getting stored in the group010_025_200.mat
file are:
- gtrains
- 2 cells containing indices of the categories used for each branch of the cascade.
- itrain
- For convenience, this is just
grains{1}
and gtrains{2}
concatenated together. Useful, for example, for viewing ctrain(itrain,itrain)
.
- ctrain
- Confusion matrix constructed entirely from training data, which was the only data used to construct the groups
- method
- 1=monte carlo clustering, 2=spectral clustering, 3=phylogenetic tree
- categories
- List of all category names (straight from
match010_025_200.mat
.
This is slightly different from what the
Top-Down Classification Roadmap shows.
Dec 23 (Sat):
Addendum To Yesterday
It turns out that spectral clustering does yield balanced trees if
you null out the diagonal. Now the groups are of length 134 and 116
which is fine.
So this means we can try both methods head-to-head and see
how they do.
By the way, new syntax seems more consistant:
[gtrains,itrain,error]= getgroups3( ctrain2 , 2 );
viewgroups(ctrain2,gtrains);
Dec 22 (Fri) : Cascade: Making The Decision Tree
Three methods are on the table: Monte Carlo Clustering, Spectral Clustering
and Phylogenetic Trees. The first two are the simplest so which one
to try first?
Monte Carlo vs. Spectral Clustering
method
Algorithm
Advantage
Disadvantage
1
Monte Carlo
Clustering
Asymmetric
matrix is ok
Slower
Inelligant
2
Spectral
Clustering
Faster
Elegant
Unbalanced?
Ugly ad-hoc
symmetrizing
Manual example (in ~/256/hist8b/1
) just so I remember the syntax:
ctrain=makesvm10(10,25,200,1:250,[],11,100,5,2);
[gtrain1,gtrain2,gtrain]= getmonte(ctrain,1);
viewsplit(ctrain,gtrain1,gtrain2);
The resulting groups are of length 121 and 129. Pretty well balanced!
The problem I'm having with spectral clustering is that it
has no incentive to make the groups balanced:
[gtrain1,gtrain2,gtrain]= getmonte(ctrain,2);
The resulting groups are of length 245 and 5! I'm not sure how to fix
this problem... perhaps by creating n groups and then merging them
in a bottom-up fashion. Or by adopting an algorithm that does not
care if the trees are balanced. If you want performance
however I'm guessing that balanced trees are essential, no matter what
algorithm you use?
Summary
In order to get balanced trees, I may be stuck with monte carlo clustering for
the time being. Need to discuss the problem with either Pietro or Lihi.
Dec 21 (Thu) : Software
Software Software Everywhere
I have finally documented the current state of the code. It is getting more complicated each day so it helped me to sit down and draw up the big picture in OmniGraffle (see right).
Dec 20 (Wed) : Top-Down
Which Taxonomies To Use Now For Top-Down Classification?
From the Dec 17 test (see below) it is looking pretty likely that spectral clustering will do the job.
Need to make an easy-to-use
function to do this, then get it into the top-down classification pipeline.
That's the nuts and bolts part.
Some open questions:
- Which is better: Spectral Clustering or Monte Carlo Clustering?
- Ultimately this will probably be settled in the arena of top-down classification performance
- Spectral Clustering is fast and efficient
- But: Monte Carlo Clustering does not require ad-hoc symmetrizing of confusion matrix
- Does spectral clustering correctly guess the total number of categories?
- What are it's 1st, 2nd, 3rd, etc. guesses?
- Compare this to my naive hand-made taxonomy for Caltech-256.
- Isn't there an interesting psychometric experiment here? Ask Pietro, Shin?
- Can either method throw a set of "background" categories that are hard to taxonomize?
- Isn't this pretty much Marco's CVPR 2007 paper?
- Ask Marco about this?
- Which is better: confusion matrix or probability distribution matrix?
- Probably the latter, but can SVM probability estimates be believed?
- Does it make the paper too complicated?
We don't need all the answers above to crank out a top-down classification paper.
But soon or later we need to deal with these issues.
Dec 19 (Tue)
Two Papers?
I'm starting to think that the phylogenetic tree story is complicated enough that it should
be handled in a separate paper from the top-down category recognition paper.
For now I'll finish up Top-down Category Recognition for ICCV2007 with little or no
mention of the phylogenetic approach. Merge this in later, or keep it as a separate paper.
Ask Pietro?
Final Tree Tweaking
Simplified maketree2.m
algorithm and arguments, now called maketree3
.
Ran 3 experiments below (the last one always crashed)?
[d,w,x]=maketree3(conf(indx,indx),categories(indx),'test37', 0,1, 3,0,0 , '', '' );
[d,w,x]=maketree3(conf(indx,indx),categories(indx),'test37G', 0,1, 3,0,0 , 'G', '' );
[d,w,x]=maketree3(conf(indx,indx),categories(indx),'test37S', 0,1, 3,1,0 , '', '' );
[d,w,x]=maketree3(conf(indx,indx),categories(indx),'test37SG', 0,1, 3,1,0 , 'G', '' );
Other than differences in overall rotation, they seem very similar.
Raccoon and iguana are mostly in very lush vegetation, which probably explains the
confusion with ferns and flowers?
This raises the important point that these pictures are being taxonomized based on background
just as much as foreground. Aside from segmenting, is there anything that can be done about this?
How Does Taxonomy Quality Depend On Ntrain?
How do taxonomies vary with Ntrain = 5, 15, 30? This is an important thing to know.
If Ntrain=5, for example, and we can't get a reasonable taxonomy, then the
whole top-down approach is likely to break down.
Generate confusion matrices overnight, for Caltech-256 and Caltech-101...
vision309>> [pmean,pstd,pall,conf]= multisvm10('~/256/hist8b',1:8,5,25,200,1:256,[],11,100,125,1);
vision310>> [pmean,pstd,pall,conf]= multisvm10('~/256/hist8b',1:8,15,25,200,1:256,[],11,100,25,5);
vision313>> [pmean,pstd,pall,conf]= multisvm10('~/256/hist8b',1:4,30,25,200,1:256,[],11,100,25,5);
vision318>> [pmean,pstd,pall,conf]= multisvm10('~/101/hist8b',1:8,15,16,200,2:101,[],11,100,25,5);
vision310>> [pmean,pstd,pall,conf]= multisvm10('~/101/hist8b',1:8,5,31-5,200,2:101,[],11,100,125,1);
vision314>> [pmean,pstd,pall,conf]= multisvm10('~/101/hist8b',1:8,30,1,200,2:101,[],11,100,25,5);
...then check on this tomorrow. Also running match030_025_200.mat for 5:8 subdirectories, which do not yet exist.
Summary
- Taxonomic relationships seem to be based as much on foreground as background
- Example: iguana and raccoons tend to be in lush forest settings
- Need to confront the segmentation problem?
- Streamlined
maketree3
routine seems to work reasonably well
- Adding global optimization and subreplicates has little effect
- Both these optimizations are slow, so that's good news
- Running some final 256x256 taxonomies now
- Then will construct trees and see if Ntrain matters.
Dec 18 (Mon) : Phylogenetic Trees
Importance of Context
On Friday, I found that a 37x37 confusion matrix made up of just the 37 test categories generated
almost meaningless trees. My guess is that, just like a human, you need to see
numerous categories in order to get a feel for their taxonomic relationships.
For example, 3 categories clearly isn't enough. Apparently, for the computer,
37 isn't enough either.
Experiment
Instead, cut the 37 categories out of the larger context of a full 250x250 confusion
matrix. In each of ~/256/hist8b/[1234]
do:
[conf15,perf]= makesvm10(15,25,200,1:250,[],11,100,25,5);
[d,w]=maketree2(conf15(indx,indx),categories(indx),'test01',1,3,999,0,1);
[d,w]=maketree2(conf15(indx,indx),categories(indx),'test12',1,3,999,1,2);
[d,w]=maketree2(conf15(indx,indx),categories(indx),'test22',1,3,999,2,2);
Here's a comparison:
As already discussed, the 1st result is nonsense.
The 2nd result is the best. It puts flowers,
mammals, structures, tools and appliances on separate branches with only occasional
errors. My guess is that cactus ends up next to dog because it is confounded with
porcupine? The other poorly classified categories are pyramid and iguana.
The 4th result is an improvement on the 3rd.
Summary
- Computers can't learn taxonomies with only a few categories
- Neither can a human!
- Thus a small subset of categories can be tested, but...
- ...only within the context of a larger confusion matrix
- Subreplicates isn't helping (at least not in current form)
- Global rearrangement helps a little? (but is very slow)
- Best affinity-to-distance conversion algorithm still unknown
- Need more experiments to see what works best. Try
- Using branch lengths
- Changing power-law of distance
Dec 17 (Sun) : Clustering
Start by getting the best hierarchy trees we can from the confusion matrix.
Compare 3 different methods, head-to-head:
- Monte Carlo Clustering
- Spectral Clustering
- Phylogenetic Algorithm (Fitch-Margoliash or Minimum Evolution)
- (Sammon Mapping?)
There will be a few code improvements along the way.
Improved Confusion Estimate
So far a leave-one-out strategy has been used for generating a confusion matrix
with nothing but the training data. I'll call this the confusion estimate
from now on. A new argument tver
in makesvm10
allows
for a leave-n-out strategy instead. This generates a less noisy confusion estimate
in less time:
cd ~/101/hist8b/1
[conf101,perf]= makesvm10(15,16,200,2:101,[],11,100,100,2);
results in conf015_016_200.mat
. Likewise I ran 4 copies of
cd ~/256/hist8b/2
[conf256,perf]= makesvm10(15,25,200,1:250,[],11,100,25,2);
on 4 machines and combined the results.
Monte Carlo Clustering
Simple top-down monte-carlo block clustering is invoked like this:
[c101,indx1,indx2,err]= deconfuse2(conf101,20000,1);
[c256,indx1,indx2,err]= deconfuse2(conf256,20000,1);
where it doesn't usually help to go any higher than 20000. For this many
iterations, rows and colums are exchanged so as to reduce the cross-variance
between clusters. Multi-level trees come from invoking the same routine recursively
on each cluster.
Confusion matrix need not be symmetric for this method to work (but deconfuse2
currently imposes this).
Spectral Clustering
Using ZPclustering.zip code from Lihi's web page.
cd ~/256/hist8b/1
load conf.mat
c= conf+conf';
for i=1:length(a), a(i,i)= 1; end;
[clusts,indx,q,v]= cluster_rotate(a,2);
indx= [clusts{1}{1} clusts{1}{2}];
aspec= a(indx,indx);
[amonte,indx1,indx2,err]= deconfuse2(a,20000,1);
for i=1:length(aspect), aspec(i,i)= 0; amonte(i,i)=0; end
Here's a side-by-side comparison of these first two clustering methods:
With spectral clustering you can specify an arbitrary number of clusters. But the hierarchical relationships between clusters are not so clear?
Preferred cluster sizes vs. Ncluster for some trial values. I'm just trying
to get a feel for if they form hierarchies:
Ncluster
Cluster sizes
2
142,108
4
106,80,6,58
8
46,74,4,35,11,26,24,30
11
149,5,8,5,5,34,3,11,5,8,17
Investigate hierarchies: do the Ncluster= 2, 4 groups overlap?
Is the group of size 108, for example, almost the same as the group fo size 106?
If so then spectral clustering is just taking the 142-member group and chopping it into
smaller pieces.
Summary
Spectral clustering looks perferable to monte carlo clustering,
but I'm not sure how best to construct a tree.
Need to investigate hierarchical relationships in Ncluster=2,4,8
clusters, for example. Do they hold common sub-categories?
Two possibilities for constructing the trees are
- top-down : recursively break groups in half as before
- bottom-up: use optimal number of clusters and then merge to larger groups
- other: create Ncluster=2,4,8... groups and look for common elements
Do we even need a tree? How about an embeddeding in a 2D or 3D space,
constructing quad trees and then searching the space? For now a tree is so
useful for top-down searching that I'd like to stick with it.
Dec 16 (Sat) : Phylogenetic Trees
Code Cleanup
I finally cleaned up the code a great deal, stripped away some old
directories etc. I don't use anymore, and commited to an svn repository.
This had started in an svn repository but that got corrupted somewhere
along the way and stopped working.
Old codebase is in ~/ee148/matlab_code
.
New codebase is in ~/pyramid/matlab_code
.
svn repository is accessible via svn+ssh://arbor.caltech.edu
via my old 2005 default password.
Phylogenetic Trees: New Procedures To Try
Two improvements to try here:
- Subreplicates
- Different file format, and "use subreplicates" option selected. In English, this mean that data can be left out where none is available. A lot of the confusion matrix is empty, so this should leave the tree free to find better configurations. And faster. In theory.
- Global Branch-Swapping
- Does this help moderate the bottom-up approach, and generate trees that are more useful for top-down purposes? Let's experiment and find out.
Bug: kitsch and fitch cannot seem to handle labels more than about 10 characters long, for some reason. And no capital letters. It has a very finicky input file parser.
Changed from kitsch to fitch because it allows global rearrangements.
Without the 2 improvements above:
[d,w]=maketree(conf,categories(2:101),100,1,3,999,0,1);
and with:
[d,w]=maketree(conf,categories(2:101),100,1,3,999,1,2);
Note: 3rd argument above really does nothing but determine the output file prefix.
Experiment
So the plan is to try these two methods on a small subset of categories only.
This is a subset with very clear hierarchical relationships that I decide in
advance. Then compare what is found to my "ground truth".
Parameters that are varied:
- Global Branch-Swapping: on and off
- Subreplicates: on and off
- Different methods for
- Converting affinity to distance
- generating subreplicate weights
- Probability matrix instead of confusion matrix
- Holds more information
- Estimate probabilities - use method of Platt? [1]
Whatever gives the best results, extend to larger category sets.
Test set:
plants= [ 154 15 25 68 103 204 118 ];
animals= [38 80 129 152 164 85 56 186 84 116 168];
appliances= [171 239 237 156 21 142 220];
structures= [187 143 167 188 86 245 132 225];
tools= [125 234 180 199 ];
c=categories;
c={c(animals),c(plants),c(appliances),c(tools),c(structures)};
showstrings(c,4,15)
1 038.chimp 2 080.frog 3 129.leopards-10 4 152.owl
5 164.porcupine 6 085.goat 7 056.dog 8 186.skunk
9 084.giraffe 10 116.iguana 11 168.raccoon
1 154.palm-tree 2 015.bonsai-101 3 025.cactus 4 068.fern
5 103.hibiscus 6 204.sunflower-1 7 118.iris
1 171.refrigerato 2 239.washing-mac 3 237.vcr 4 156.paper-shred
5 021.breadmaker 6 142.microwave 7 220.toaster
1 125.knife 2 234.tweezer 3 180.screwdriver 4 199.spoon
1 187.skyscraper 2 143.minaret 3 167.pyramid 4 188.smokestack
5 086.golden-gate 6 245.windmill 7 132.light-house 8 225.tower-pisa
load hist005_025_200.mat; % just to get categories
indx= [animals plants appliances tools structures];
save treetesting.mat animals plants appliances tools structures conf categories indx
Also acceptable if the algorithm puts porcupine on the same branch with cactus!
And now the experiment:
cd ~/256/hist8b/1;
load treetesting.mat
[conf15,perf]= makesvm10(15,25,200,indx,[],11,100,100,5);
[d,w]=maketree(conf15,categories(indx),37,1,3,999,0,1); % method 1
[d,w]=maketree(conf15,categories(indx),37,1,3,999,1,2); % method 2 (better?)
These trees look like junk. It is as if leaves were put next to each other at random.
Whether method 1 or method 2 above.
Summary
An interesting discovery today.
Building a hierarchy seems to require large numbers of categories. I got junk trees (see figure to the right) when running just the 37 test categories, and just a 37x37 confusion matrix.
Hypothesis: you can't generate meaningful trees on a small subset of the data.
This is the standard procedure most of the time: start with 5 categories, move up to 10,
50 and so on. But this won't work here.
Suppose you were a human infant who had never seen the world.
Could you make a hierarchy after seeing just 5 categories? 10? 37?
No.
You would need to a great many categories before you
got a feel for their taxonomy. The computer is no different.
That's intuitive but I never really thought about it before.
Need to confirm the above hypothesis. To be continued...
Dec 15 (Fri)
Plan for next week
- Tree algorithm
- Use subreplicates : improvement?
- Compare two different methods for affinity-to-distance conversion
- Algorithm for finding logical branch points
- Any top-down bottom-up hybrid algorithms out there?
- SVM
- Figure out why pairwise and dagsvm modes aren't working
- This would speed things up considerably
- Top-down classification experiments
- Fix bugs in pipeline
- Compare level=0 and level=1 performance
- Simplify pipeline
- Generalize to any number of levels
- Compare level=2,3... performance
- Finish up ICCV paper
- Phylogenetic section
- Results & Conclusions
- Pietro, Eugene for comments
Plan for 2007
I sat down and tried to think about where all this is going over the next few years.
See diagram below:
Dec 11-14 (Mon-Thu)
Home with gastroenteritis...
Dec 6 (Wed)
Original Datset + 8 New Categories (for Ryan)
This is for a possible: supplemental figure. There was the problem before
where some of the file names in category matress
were truncated.
Ryan sent the index file and now we can fix that:
load ~/256/ryan/trainIndex.mat
load ~/256/hist8b/1/hist075_004_200.mat
ftrain1= ftrain(trainImages);
newcats={'129.leopards-101','253.faces-easy-101','232.t-shirt',...
'145.motorbikes-101','251.airplanes-101','240.watch-101','137.mars',...
'092.grapes'};
cd ~/256/sift8
ftrain2= {};
for cat=newcats,
x= pickfiles(cat,350,0);
ftrain2= {ftrain2{:} x{:}};
end
ftrain= {ftrain1{:} ftrain2{:}};
for i=1:length(ftrain), ftrain{i}((end-2):end)='mat'; end
makehist7({},'~/256/ryan','cvpr',ftrain,{});
makematch7(1,{},'~/256/ryan','cvpr');
Running now on vision310
.
Resulting file should be ~/256/ryan/histcvpr.mat
Dec 3 (Sun)
Final Update
There are still some bugs to work out so (unfortunately) this is going to have
to wait for ICCV 2007.
Dec 2 (Sat)
Speed And Performance As A Function Of Threshold
Need plot showing tradeoff between speed and performance as the threshold is adjusted.
Syntax Notes
makebranch(0.25,15,16,200,1:100,[]);
New File for Ryan Bombed
Oops- corrected a mistake: searching for .mat has to be done carefully
since 138.matress
is a category. There appear to be some
errors in the filelist which I'll track down later (this is going in
the supplemental section anyway).
Dec 1 (Fri)
Tree Structure, Algorithm and Code Overview
Current overall algorithm that appears to work best?
makebranch
- Simple enough that it can be re-run recursively
- Threshold yet to be determined.
- Specifies how many files are selected for 1&2 vs. Neither
- Selection based on their SVM margins.
- Use training data only
makesummary
- Traverse the tree gathering stats
- Confirm how many files were declared 0/1/N at each level
- training/testing execution times at each level
- final performance and confusion
In more detail, makebranch runs the following:
makesvm
- hist*_200.mat => match*_200.mat => conf*_200.mat
- Inputs: overall match file
- Ouputs: "training" confusion matrix
- Uses only training set, leave-one-out,
nver
>0
makegroups2
- conf*_200.mat => group*_200.mat => match*_001.mat
- Inputs: "training" confusion matrix
- Outputs: "
- two groupings that minimize confusion matrix crossterms
- match file classed by these 2 groups, ready to test
- with all the original classes & file info & margins
- the leaves will need this
makecascade
- Takes threshold as argument
- Assign each training images to either 0, 1 or 2 (Neither)
- Train on match*_001.mat
- Build 3 new match matrices:
- N0train x N0test , N1train x N1test , N2train x N2test
- Where:
- N0test+ N1test+N2test = Ntest
- N0train+ N1train= N2train= Ntrain
- Inputs: match*_001.mat
- Output: match*_0[012]1.mat
New Match File for Ryan
See new makehist7
which is a trivial modification of makehist6
. The only difference is that you can manually input any ftrain, ftest
files
you want. Just so I remember the syntax:
cd ~/256/sift8;
load ~/256/ryan/trainNames.mat
ftrain1= trainNames; % set of 5097 files
newcats={'129.leopards-101','253.faces-easy-101','232.t-shirt','145.motorbikes-101','251.airplanes-101','240.watch-101','137.mars','092.grapes'};
cd ~/256/sift
ftrain2= {};
for cat=newcats,
x= pickfiles(cat,350,0);
ftrain2= {ftrain2{:} x{:}};
end
ftrain= {ftrain1{:} ftrain2{:}};
for i=1:length(ftrain), ftrain{i}((end-2):end)='mat'; end
makehist7({},'~/256/ryan','cvpr',ftrain,{});
Remind Ryan to look for duplicates: I'm picking the ~350 files at random, so they
may overlap with the fileset already in his trainNames
Should
be able to see this as 1's in the match matrix.
See results in ~/256/ryan
.
Nov 30 (Thu)
Finished the Hierarchy-Based Analysis Pipeline
- Divide each conf file into 2/4/8 sets of categories
- Train initial part of cascade
- Train/test final part of cascade
- Run more conf files if machines are available.
This has the following steps:
- makesvm : confusion matrix using training set only (leave-one-out)
- makegroups: make 2/4/8/etc. groups based on confusion matrix
- makesvm stage 1: sort based on upper-level categories
- Results in
conf1,conf2...
with corresponding file lists
- makesvm stage 2: one-vs all on the leaves
- maketree: makes a little tree (just to help visualize this)
- performance plots
Finished all this today.
Now... spend all Friday generating first performance plots!
Nov 29 (Wed)
Started Writing
Began writing... just Abstract and Intro and Bibliography (while some stuff was running).
Setting Up Top-Down Category Benchmarks
vision312>> cd ~/101/hist8b/1; mat=makesvm(5:5:30,31-(5:5:30),200,2:101,[],11,100,25);
vision320>> cd ~/101/hist8b/2; mat=makesvm(5:5:30,31-(5:5:30),200,2:101,[],11,100,25);
vision315>> cd ~/256/hist8b/1; mat=makesvm(10:10:30,25,200,1:250,[],11,100,25);
vision309>> cd ~/256/hist8b/2; mat=makesvm(10:10:30,25,200,1:250,[],11,100,25);
vision310>> cd ~/256/hist8b/3; mat=makesvm(10:10:30,25,200,1:250,[],11,100,25);
vision123>> cd ~/256/hist8b/1; mat=makesvm(5:10:30,25,200,1:250,[],11,100,25);
vision123>> cd ~/256/hist8b/2; mat=makesvm(5:10:30,25,200,1:250,[],11,100,25);
vision301>> cd ~/256/hist8b/3; mat=makesvm(5:10:30,25,200,1:250,[],11,100,25);
Should now save conf
files along the way, which is the first step
in the pipeline. These are created entirely with the training data (leave-one-out
strategy).
Nov 28 (Tue)
CVPR Madness
Deadline approaching. The plan:
- Nov 28 (Tue)
- Final trees, run benchmarks. Write: title and intro, add refs.
- Nov 29 (Wed)
- More benchmarks. Write: method section w/ phylogenetic algorithms. Meet with Pietro
- Nov 30 (Thu)
- Time to try an alternate tree geometry? Write: top-down section.
- Dec 1 (Fri)
- Write: results, conclusions. Meet again w/ Pietro
- Dec 2 (Sat)
- Final writing. Feedback?
Syntax
I want to remember this two-line incantation for generating trees by hand
(not with makesvm
):
mats= makesvm10(5,26,200,2:101,[],11,100,10);
[d,x,x1]= maketree(mats,categories(2:101),5,0,3,999,1);
Tree Code Fixes
Found a few bugs. One was that the phylogenetic software is aweful at parsing, and
it will not (for some reason?) allow category labels with capital letters. It crashes
in a strange way, claiming that the distance matrix diagonal element is not zero.
I'm re-running the trees...
Nov 27 (Mon)
If this goes ok, these will be the final trees before running final top-down
benchmarks for paper:
Final Tree Generation for Caltech 101/256 for CVPR Taxonomy Paper
Simplified maketree
algorithm to make paper easier to write.
See maketree
, normflag=0.
Trees automatically generated within makesvm
now when nver
is negative.
For Caltech 256:
cd ~/256/hist8b/1;
makesvm([5 10 15 20 25 30],25,200,1:250,[],11,100,-25);
Currently running ~/256/hist8b/{1,2,3,4}
on vision309,315,317,318
Likewise for Caltech 101:
cd ~/101/hist8b/1;
makesvm(5:5:30,31-(5:5:30),200,1:101,[],11,100,-25);
Which is running on vision312
. Notes:
- Caltech-101 run is a bit of an experiment, since I'm including clutter
- See where this ends up in the tree?
- These all use
method=11
which stands for one-vs-all.
I should have resulting trees by Tuesday morning.
More Boosting Comparisons: Scene Database
So this test assumes ~/scene/hist8/1/match100_100_200.mat
is the input, and that mtrain
gets divided in half (randomly) 100 times so that Ntrain = 50 and Ntest = 50
each time:
alex1(100,'alex',100,'match100_100_200.mat','match050_050_200.mat');
p05= multisvm('~/scene/hist8/1/alex',1:100,50,50,200,1:5);
p10= multisvm('~/scene/hist8/1/alex',1:100,50,50,200,1:10);
p15= multisvm('~/scene/hist8/1/alex',1:100,50,50,200,1:15);
Minor note: realized as I was doing this that ~/scene/hist8b
match
files are 1500x1500x3 instead of 1500x1500 because I used to store the 3 match
levels separately (just sum over the 3rd dimension). I forgot to tell Alex about this.
Final numbers:
Ncat
Performance
5
94.73%
10
87.62%
15
77.95%
Nov 26 (Sun)
Color Pyramid Matching for the CVPR Boosting Paper
This may help us with the boosting paper. Found some references to a LAB
sort of sounds like what I had in mind.
The idea would be to use the exact same algorithm I'm using now, except
to run it separately on the L, A and B images and exploit all 3 kernels
instead of just the L kernel (which is, more or less, what I'm using now).
Alex and I discussed trying this on the scene database first.
Nov 22
Abstract Submission for CVPR Taxonomy Paper
Submitted this to CVPR from Florida where I'm currently working.
Here is where I can click
to submit the final paper.
Rewrite this to emphasize what is novel.
Learning and Using Taxonomies For Visual Category Recognition
G. Griffin, E. Bart and P. Perona
Humans understand the world around them, in part, by learning hierarchical relationships between categories of objects. Someone playing the game of "20 Questions" is likely to exploit such a mental taxonomy in order to guess the unknown object with the fewest possible queries. As computers learn to recognize and classify hundreds or thousands of categories, taxonomies become more and more important. Yet it is impractical for a human operator to continually generate them by hand. We use the Pyramid Matching algorithm of Lazebnik et. al. to define an affinity between categories of images in the Caltech-256. Phylogenetic algorithms are then used to automatically construct a full taxonomy. We then use a simple combination of decision trees and multi-class SVMs to show how efficiency can be drastically improved, while performance remains comparable with the best current algorithms. We conclude with a discussion of potential applications.
Nov 20 (Mon)
More coding today...
Planning for the CVPR Taxonomy Paper
Nov 18 (Sat)
Comparison To Alex's Boosting Results : Caltech 256
Alex is using ~/256/hist8b/1/match050_025_200.mat
and (randomly)
dividing into two pieces. I want to use the exact same dataset for a more
accurate comparison.
He gets around 72/63% for 5/10 classes at Ntrain=25, at least as of yesterday?
The routine alex1(100)
takes the above match file
and constructs 100 sub-directories in ~/256/hist8b/1/alex
.
In each directory is a new match file created by choosing 25
members of each class in mtrain
to be train and test,
respectively. Then:
>> p5= multisvm('~/256/hist8b/1/alex',1:100,25,25,200,1:5);
>> p10= multisvm('~/256/hist8b/1/alex',1:100,25,25,200,1:10);
>> p25= multisvm('~/256/hist8b/1/alex',1:100,25,25,200,1:25);
>> p50= multisvm('~/256/hist8b/1/alex',1:100,25,25,200,1:50);
>> plot([5 10 25 50],[p5 p10 p25 p50],'x');
Results:
Ncat
Performance
5
74.0%
10
63.8%
25
59.2%
50
47.8%
Nov 17 (Fri)
Constructing Tree Directly From Kernel : Not Working So Far
The tree to the left was the best I was able to get, using the first 100 categories
only (because it is much faster this way). The idea here is, instead of using the
confusion matrix produced by the SVM, find the distances between clusters of categories
in the kernel. By distance we mean, histogram all the distances between the images in those two categories and find the
the p-percentile high or low values. For example if p=0.50 we would be using the median.
IMO the results are not very good.
d=maketree(.5,1,5,100,2,30,25,200);
!treedraw outtree030_0.50 90 0.5
Note that the maketree
code is currently on my macbook,
and is different from the maketree
code in the vision
cluster (need to merge these two codesets soon).
Perhaps there is some sort of problem with normalizing the kernel in a way that makes sense for the phylogenetic software. I tried several different methods of converting affinities into distances. I also tried p=.95,.90,.50,.10,0.5.
Maybe there is more fundamental problem here.
There is a reason we use an SVM, instead of just clustering from the kernel itself. The SVM bumps the data up into a higher dimensional space where the categories are easily separable. The raw kernel does not. If the raw kernel clustered well, we would just skip the SVM and use a GMM instead.
So when we try to use the kernel itself to build the tree, aren't we essentially throwing away all the advantages of using an SVM?
I am still trying to get this to work. But
I am more inclined to use the confusion matrix for building the trees,
instead of the affinity matrix
At least for the CVPR2007 paper, since this has been
shown to work fine.
Nov 16 (Thu)
Performance as function of Ncategory
This is of interest to Alex and I (for comparison to his active learning results):
What sort of performance does Pyramid Matching get on Caltech101/Caltech256
if you use only a limited number of categories?
The green cross shows the performance Alex is currently getting (if I'm
interpreting his plot correctly). We are both using Ntrain=25.
The wide regions represent multiple different random choices of categories.
This shows that Caltech 101 categories are less challenging than the
Caltech 256 categories: performance is worse, even when the number of
categories is the same.
rand('state',sum(100*clock));
x=randperm(100)+1;
p=[0]; n=[5 10 15 20 25 50 100];
for i=1:length(n),
n(i)
[pmean]= multisvm10('~/101/hist8b',1:4,25,6,200,x(1:n(i)));
p(i)=pmean;
end
rand('state',sum(100*clock));
x=randperm(256);
p=[0]; n=[5 10 15 20 25 50 100 150 200 256];
for i=1:length(n),
n(i)
[pmean]= multisvm10('~/256/hist8b',1,25,25,200,x(1:n(i)));
p(i)=pmean;
end
LabelMe Database
It will be useful for Ryan and me to extract sift features for the
LabelMe Database. Want to download
and run this today.
Oops... we have to do some data segmentation before it will let us download
the whole database. This is sort of a pain and I didn't finish yet.
Update: Ryan says they're not going to use it anyway.
Nov 15 (Wed)
Top-down Category Creation
Phylogenetic trees are not balanced. A more balanced binary tree may be created as follows ( Ntrain = 25 ). Below just generates the first 2 branches,
but of course this can be done recursively.
Caltech-101:
cd ~/101/hist8b/1
mat=makesvm(15,31-(15),200,2:101,[],11,100,25);
c=mat+mat';
Caltech-256:
cd ~/256/hist8b/1
mat=makesvm(15,25,200,1:250,[],11,100,25);
c= mat+mat';
Then rewrote deconfuse
as deconfuse2
. A much more
efficient search algorithm: starts by taking the rows with the biggest
out-of-block variance and swapping them. This does the bulk of the work.
Then do ~10000 random swaps for minor improvement:
[c,i1,i2,improvement,f]= deconfuse2(mat+mat', 100000); [length(i1) length(i2) f], imagesc(c);
EPCA: Entropic Principle Component Analysis?
The idea is that it is not good enough to just pick axes with a lot of variance.
You also want to pick axes with low entropy, ie. lots of clumping. Especially
in the unsupervised case.
Is this idea reasonable at all? Quick test:
cd ~/256/hist4
load hist350_000_200.mat
bins= bagofwords2(htrain);
[eigenvectors,z,eigenvalues]= princomp(bins);
bins2= bins*eigenvectors;
class= ceil((1:1400)/350);
scatter3(bins2(:,1),bins2(:,2),bins2(:,3), 25, class');
Now plot each of the first 6 axes against one another:
x=-.15:.005:.15; y=-.15:.005:.15;
for i=1:6,for j=1:6,
z=hist3( bins2(:,[i j]), {x, y});
subplot(6,6,i+(j-1)*6); imagesc(z);
if j==6, xlabel(num2str(i)); end;
if i==1, ylabel(num2str(j)); end;
end; end
It is not so ambiguous, in this case, that the PCA space (ie. the space in which
the features are linearly independent) is also the optimal space for clustering.
But with more categories, is it still clear that PCA is what you really
want to do?
Maybe there are axes with very little variance that, nonetheless, are
informative - ie. they form compact, well defined clusters.
Above we had very easy categories to work with: motorbikes, airplanes,
t-shirt and faces-easy. Let's try some less obvious classes instead:
load hist075_004_200.mat
bins= bagofwords2(htrain);
[eigenvectors,z,eigenvalues]= princomp(bins);
bins2= bins*eigenvectors;
n= length(bins2);
class= ceil((1:n)/75;
%f= find(class<=4);
f= find( (class<6) & (class<=10) );
a=10;
for i=1:a,
for j=1:a,
xstd= std(bins2(f,i))*3;
ystd= std(bins2(f,j))*3;
x=-xstd:xstd/10:xstd;
y=-ystd:ystd/10:ystd;
z=hist3( bins2(f,[i j]), {x, y});
subplot(a,a,i+(j-1)*a); imagesc(z);
if j==a, xlabel(num2str(i)); end;
if i==1, ylabel(num2str(j)); end;
end;
end;
scalesubplots(1.2);
So far I'm not very optimistic: if your eye can't see any "blobs", it is
unlikely that the mutual entropy calculation
will turn up anything either.
Still it may be worth computing a mutual entropy matrix on the
original bins
and diagonalizing this. This isn't quite the
same thing as doing PCA. At the end of PCA the new set of 200 "eigenfeatures"
are uncorrelated. At the end of EPCA, the features would be statistically
independent. This is not the same thing. And after all, information
is what we ultimately care about isn't it?
But I'm still not sure how to formally define EPCA. I think we would have to
do away with the and replace it with . Then
we have something that has the same dimensionality as the covariance matrix.
The convention of using is just one particular way of
defining entropy, not the only way.
Even if there is such a method, I doubt it will be ready for CVPR2007.
Drop it for now.
Need to think about: relationship to spectral clustering.
Nov 14 (Tue)
How to Use Hierarchies: Top-down approaches
I'm implementing a simple decision tree at the moment, because it is easiest.
It is not clear what we ultimately will use.
There are at least 4 different approaches to the top-down category search, that have come up so far in discussions. Summary below. For CVPR2007, we will not have time to try them all.
Method
Suggested by
Advantages
Disadvantages
Parameters
to tune
Simple decision tree
Greg
Very fast
Simplest implementation
Performance
Nlevels
Fleuret and Geman approach
Pietro
Good speed
Good performance
Threshold
Viola and Jones approach
(similar to what Eugene
implemented for Scene recognition)
Eugene
Good performance
Speed
A little harder to implement
Cascade levels:
How many?
How leaky?
How asymmetric?
DAG for Hierachical
Classification
Chin-Jen
Lin
Elegent, fast, performance
Implementation?
Nov 13 (Mon)
Notes from meeting w/ Pietro
See our notes.
We discussed, among other things, using the affinity matrix itself to build the taxonomy instead of the confusion matrix. This is a more direct and less complicated (don't need to repeatedly leave-one-out of the training set).
Spectral clustering may eventually be worth a try here. But we want to keep things as simple as possible for now.
This suggests the following change to yesterday's list:
- Compute top 4 sub-groups from affinity matrix
- Using pure phylogenetic approach, not a hybrid block-diagnolization approach
- Harder to explain in paper
- Optimization is premature
- Generate affinity matrix using only the training set
- Generate tree from affinity matrix
- How does tree compare to the confusion-matrix approach?
- Make this an automatic part of the analysis pipeline
Nov 11 (Sat)
Top-down category recognition
See web page here with more details.
Here's a brief breakdown of the experiment #1:
- Automatically compute top 4 sub-groups (w/ training set only)
- Top level: evaluate which "branch" each test image is on
- Bottom level: evaluate which "leaf" each image agrees with
- Compare to the standard one vs all testing scheme
To start, let's break item 1 down into it's algorithmic parts, and implement:
- Automatically compute top 4 sub-groups (w/ training set only)
- Generate confusion matrix using only the training set
- Otherwise the resulting tree would be cheating
- Repeatedly leave-one-out of training set and use these for "testing"
- Generate tree from confusion matrix
- Just the top 4 branches
- Phylogenetic approach is overkill for this
- Simply re-group the confusion matrix into 4 well-correlated blocks
- Make this an automatic part of the analysis pipeline
Coding today...
Nov 10 (Friday)
Plan for CVPR 2007 Papers
Made some initial notes on the following.
- CVPR2007 Paper Topic #1: Top-down category recognition
- CVPR2007 Paper Topic #2: Generating taxonomies
Need to discuss/modify these topics with Alex, Eugene and Pietro.
Which paper is more worthwhile, if I can only write one?
Or how about some other topic?
Some notes after talking w/ Ryan
I wonder if KPCA alone is good enough. Clutter-like features have tons of variance overall, but they don't cluster well (ie. they aren't distinctive to a particular class). We would like to find a basis that maximizes variance AND "clumpiness" at the same time. Isn't this just another way of saying: maximize the ratio of variance to entropy?
This would be like an unsupervised version of Fisher Discriminants. Could this be a good general-purpose tool?
I need to think more about this:
Diagonalize Matrix of Mutual Covariance
Ordinary PCA
Diagonalize Matrix of Mutual Entropies
Maximally "informative" axes???