These are my notes on this paper.


Tracking datasets have issues:

  • Videos are only 30 seconds long
  • No examples of the object disappearing
  • etc

The researchers present a new dataset they built that they suggest should be a new benchmark.

New Dataset Description

  • Videos have avg length of 2.4 minutes
  • Half of videos have targets disappear
  • Dev/test splits provided
    • Splits formed by splitting by class.
  • New eval method that takes into account presence of the target
  • Selected 337 videos from YouTube Bounding Box dataset (YTBB) and manually combined short tracks
  • Manually selected initialization frame + discarded preceding frames

5 people did the work. 2 people verified each completed example.

Grad student lyfe.


They propose that we use the geometric mean of true positive rate * true negative rate (sqrt(TPR * TNR)) as the evaluation measure. Pretty graphs to explain why.

Conclusion: we have a new large tracking dataset w/ sparse 1Hz labels. “It’s the largest ever compared to other tracking dataset”.

No other major insights other than “other datasets don’t have true negative rate.” No new methods. A bit disappointing.

Note: Paper hasn’t been accepted anywhere yet so maybe not that interesting.


DAVIS dataset is really well designed and could be used for tracking even though it’s not specifically designed for tracking.