Learning a Visual Tracker from a Single Movie without Annotation

Authors: Lingxiao Yang, David Zhang, Lei Zhang9095-9102

AAAI 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments demonstrate that our approach is insensitive to the employed movies, and the trained visual tracker achieves leading performance among existing unsupervised learning approaches. Even compared with the same network trained with human labeled bounding boxes, our tracker achieves similar results on many tracking benchmarks. Experiments In this section, we firstly conduct several experiments to study the effect of various options to train a DCFNet.
Researcher Affiliation Academia Lingxiao Yang, David Zhang, Lei Zhang Department of Computing, The Hong Kong Polytechnic University, Hong Kong {cslyang,csdzhang,cslzhang}@comp.polyu.edu.hk
Pseudocode No The paper does not contain any structured pseudocode or algorithm blocks.
Open Source Code Yes Code is available at: https://github.com/ZjjConan/UL-Tracker-AAAI2019.
Open Datasets Yes Testing datasets The experiments are evaluated on OTB-2013 (Wu, Lim, and Yang 2013), OTB-2015 (Wu, Lim, and Yang 2015) and VOT-2015 (Kristan, Matas, and Leonardis 2015) datasets. Image Net (Deng et al. 2009), with human annotated object labels, while the Siamese networks (Bertinetto et al. 2016; Wang et al. 2017; Valmadre et al. 2017; Li et al. 2018) are pre-trained on ILSVRC2015 videos (Russakovsky et al. 2015) with given object bounding boxes.
Dataset Splits No The paper describes using standard benchmark datasets for testing (OTB, VOT) and training on a single movie, but it does not specify explicit training/validation/test dataset splits (e.g., percentages or counts) needed to reproduce the data partitioning of their training source.
Hardware Specification No The paper does not provide specific hardware details (e.g., GPU/CPU models, memory amounts, or detailed computer specifications) used for running its experiments.
Software Dependencies No We implement on MATLAB using Mat Conv Net toolbox (Vedaldi and Lenc 2015). - This mentions software names but lacks specific version numbers required for reproducibility.
Experiment Setup Yes Parameters for DCFNet. For fairly comparison to supervised learning, all parameters in DCFNet are the same as (Wang et al. 2017) to tease apart that effect. In detail, the maximal interval L for frame pair sampling is 10. λ and γ in Eq. (2) and Eq. (4) are set to 1e 4 and 5e 4 respectively. For online tracking, the λ is the same to the off-line learning, and the importance factor αt is set to 0.01. Parameters for off-line learning. We set I = 4 and V = 400 for D construction. For each sampled frame pair, we track all object proposals in the first frame and then use FBA to select at most K = 16 examples to construct database D. The NMS threshold in FBA is set to 0.3, which is the same to most of object detection system. Usually, D consists of around N = 25, 000 pair regions for each optimization round. For network optimization, the initial weights are randomly generated using improved Xavier technique (He et al. 2015). We train the network using Stochastic Gradient Descent. For each iteration, the mini-batch size is M = 32. Momentum rate and weight decay is set to 0.9 and 5e 4 respectively. We repeat above steps until all clips from a movie are reviewed and then start a new epoch. The network is trained for 10 epochs with a learning rate exponentially decaying from 1e 2 to 1e 3.