Zanetti CVPR08 outline

From Vision Wiki
Jump to navigation Jump to search

Title

  • "A walk through the videos one finds on the internet"
  • "A walk through YouTube videos"
  • "Exploring generic videos"
  • "Exploring YouTube videos"
  • "An exploration of web video clips"
  • "A walk through the web's video clips"

comment: I know we want something more general than YouTube but right now, YouTube is all we've got, so maybe we shouldn't claim more than that.

Title votes:

  • Lihi: I like the one Sara proposed. Pietro, is it correct English?
  • Sara: I like the first one but I think the last one is more "professional". What about "A walk through the web's video clips" ?
  • Pietro:

Abstract

Millions of video clips were posted on the web every day during the last couple of years. The popularity of web-based video databases poses a number of challenges to machine vision scientists: how do we organize, index and search such large wealth of data? Content-based video search and classification have been proposed in the literature and applied successfully to analyzing movies, TV broadcasts and lab-made videos. However, we find that the statistics of web video clips are different because their quality is lower and their subject is more varied. Consequently, algorithms we have come to trust do not work well any longer. Our exploration is achieved by analyzing a large data-set of a few thousand videos we collected from the web. We describe our data collection techniques and make the data publicly available to encourage further research.

1. Intro

(The Wikipedia page on video clips is here).

  • Millions of video clips are posted on the web every day.
  • Current search techniques are based on keywords.
  • Thumbnails of 1 frame are not sufficient.
  • Search even more difficult than for pictures.
  • Can we utilize these enormous amounts of data?
  • Can we classify such videos?
  • Can we search these databases?
  • We explore the characteristics of web videos
  • We try standard algorithms (and show they mostly fail)
  • We point at what needs to be done

2. The collected database

2.1 The methodology of data collection

  • keywords used for collecting the data and why we chose them
  • which videos were downloaded for each keyword (i.e., did we take the top 10 or was there any pruning)

2.2 Technical details about collecting the data

  • which software we used
  • the degradation in quality
  • how long it takes to collect data
  • what is the required storage space
  • how many corrupt videos

2.3 Hand-labeling the data

  • Which keywords were chosen to label the data by hand.
  • Multiple keywords per video
  • Unclassifiable videos: add one label for these, show examples
  • Frequency of each label in the collection
  • Consolidating infrequent labels
 Figure: (a) histogram of number of videos per final label, (b) histogram of how many labels each video has

3. Web videos characteristics

To develop algorithms which handle generic videos we first need to study the characteristics of the data.

3.1 Categories

There is a large variability among web videos. They include the familiar TV broadcasts and movies, but also:

  • presentations
  • cartoons
  • animations
  • graphics
  • advertisements
  • home videos
  • lots of crazy stuff.

3.2 Basic statistics

  • Video length
  • Number of cuts
  • Length of shots
  • Resolution (all are 320x240?)
  • Quality

Can we compare this to the statistics of standard TV shows or movies?

 Figure: (a) histogram of video lengths (log log?), (b) histogram of length of shots

3.3 Degradation

we could take short clips with cameras, cellphones and video cameras, upload them to youtube and download them again. compare the quality of original vs youtube compressed

4. Do known algorithms work for generic video?

4.1 Cut detection

Show the performance degrades compared to experiments on nice TV broadcasts and show a few examples of what goes wrong.


 Figure: show examples of sequences where automatic cut detection has trouble.

4.2 Supervised classification

Since a given video may have multiple labels, we should classify independently videos w.r. to all the labels. I.e. consider the labels independent and run L binary classifications, where L is the number of labels.

The experiment could be run using nearest-neighbor classification. If we use k-nearest neighbors, we could obtain ROC curves.

We could initially try Fisher, to represent graphically the performance of linear classification thresholds.

So: each classification will be attempted both using Fisher and using nearest neighbor classification, for each label.

 Figure: for each label show ROC for the 1st Fisher dimension and show ROC for k-nearest neighbors with 1 out of 9, 2 out of 9 etc

4.3 Unsupervised clustering

Since supervised is problematic (we don't have good labels) we try exploring the data using unsupervised methods.

  • what do we discover?
  • what is the distribution of cluster sizes? (e.g. do we get a few large and a few small ones, or all have the same size?)
  • is there any correspondence with the manual labels we have?

4.4 Nearest neighbor query

Both classification and clustering seem too presumptuous at this point. Can we at least search for "similar" videos in the database?

  • show results for visual-words and space-time-histograms
  • try a few "easy" videos.
  • show failures for randomly selected videos



5. What's next?

We've shown nothing works, but is there any hope for future success?

  • we have collected a dataset which we're going to make available
  • we need tools for video summarization and fast video browsing to enable faster labeling
  • most videos can get more than one label. We thus need to develop clustering and classification algorithms which allow assigning a video to more than one cluster.

6. Summary