CVPR 2011 Human Object Detection

From Vision Wiki
Jump to navigation Jump to search

Datasets

Chosen for paper

Candidates

Questions

  • Should we try combining binary verification with the bounding box tasks?

Chosen Experiments

High Priority

  • Pedestrian experiment: Pietro's pedestrian dataset bounding boxes
  • Noisy signal: Hollywoord hills clicks
  • Crowding: cell counting, clicks

Low Priority

  • Multimodel labels: legs of herons, sport fishing boats, fluffy hair, humming birds, clicks? (maybe something elongated with a distinct head?)
  • Accumulation of evidence
  • Ambiguous Signal:


Suggested experiments

Peter, I am listing here a few ideas for experiments. I am trying to explore the different `regimes' that the system will be facing. Let's discuss the ideas asap and let's prioritize them.Perona 08:41, 5 Oct 2010 (PDT)

Baseline experiment - Draw boxes around pedestrians in a collection of pedestrian images. I have one such collection. We will be able to count false alarms, missed pedestrians, measure the accuracy of the boxes (e.g. different accuracy for different annotators?).

Multimodal labels - I would imagine that sometimes the same object may be annotated in multiple ways. E.g. if you ask annotators to draw boxes they may sometimes decide to ignore some thin semi-irrelevant part of the object, e.g. the legs of a heron, or the out-stretched arm of a pedestrian. Can the system handle multi-modal labels? We have to think of an appropriate dataset to demonstrate this.

Noisy signal - The `objects' might be very difficult to see. In this regime you would expect lots of misses and false alarms for each annotator. Can the system figure out the `correct' detections if enough annotator labels are present? Can we plot the quality of the annotation as a function of the number of the annotators? For this experiment we could ask the annotators to click on swimming pools in this picture of the Holliwood hills. We could use annotations on this higher resolution version as ground truth. Another experiment could be run on pictures containing birds that are perched on leafy trees and thus may escape detection.

  • A slightly less noisy signal may be looking for taxi cars in NYC
  • If possible, carry out an experiment with synthetic signals where we can plot performance as a fn. of noise. E.g. detecting white disks on black background + white noise.

Ambiguous signal - What if the objects were easy to see, but somewhat ambiguous, e.g. counting the joshua trees in this picture of Joshua Tree National Park? Is this the same case as the 'Noisy signal' above?

Accumulation of evidence - What happens when you show the previous annotators' clicks to each new annotator? Faster convergence? Risk of more misses? Find out by repeating the noisy signal experiment above with an overlay of the pervious annotators' clicks. In order to avoid clobbering the signal with dots we could instead superimpose thin circles centered on the click.

Needle in a haystack - Jeremy Wolfe discovered that human detection rates plummet when the `object' to be detected is rare. We showed that this is due to the fact that observers optimize `expected reward' (Homo economicus in visual search, JOV 2009 PDF). We could work on needle in a haystack search and give the annotators different reward structures and see what happens. This is a very useful piece of knowledge for the machine vision community - I do not think that many people know about it. The task could be finding tennis courts in Google Earth pictures of Venice (there is only one). Or perhaps count police cars in overhead pictures of Los Angeles (many, but probably sparse).

Crowding - What happens when the things to be clicked are very close to each other? Will the software try to merge neighboring clicks? For this experiment one could use cell cultures. We could also count the people in this picture of St. Mark's square.

  • The cell detection dataset from Zisserman's group may also be appropriate

Different annotation primitives - We should experiment with labels that are intrinsically different. At a minimum: single clicks (e.g. in the center of each one of the cells in a dish), lines (e.g. click on the two ends of elongated cells such as e. coli), boxes (e.g. around pedestrians). It would be nice to have also experiments on contours, will there be enough time to do them?

Combining AMT workers with automata - Suppose that one had an automaton with a given ROC, how would one use it in the different scenarios: needle in a haystack, noisy signal, ...? Would one first run the automaton and use its output to prioritize the work of the human annotators? Other ideas?

General comments:

  • Working on Google Earth (or Google Maps) pictures has the advantage that we can use higher resolution versions for providing ground truth.