Two-Stream Convolutional Networks for Action Recognition in Videos

Authors: Karen Simonyan, Andrew Zisserman

NeurIPS 2014 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our architecture is trained and evaluated on the standard video actions benchmarks of UCF-101 and HMDB-51, where it is competitive with the state of the art.
Researcher Affiliation Academia Karen Simonyan Andrew Zisserman Visual Geometry Group, University of Oxford {karen,az}@robots.ox.ac.uk
Pseudocode No The paper describes methods and architectures in text and figures, but does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks, nor structured, code-like steps.
Open Source Code No Our implementation is derived from the publicly available Caffe toolbox [13], but contains a number of significant modifications, including parallel training on multiple GPUs installed in a single system. (No explicit statement of their code being released or a link to it.)
Open Datasets Yes The evaluation is performed on UCF-101 [24] and HMDB-51 [16] action recognition benchmarks, which are among the largest available annotated video datasets
Dataset Splits No The evaluation protocol is the same for both datasets: the organisers provide three splits into training and test data, and the performance is measured by the mean classification accuracy across the splits. (The paper only explicitly states train/test splits provided by organizers, not a specific validation split for their own experiments on UCF/HMDB, though a validation set is mentioned for ImageNet pre-training and implicit for fine-tuning.)
Hardware Specification Yes Training a single temporal Conv Net takes 1 day on a system with 4 NVIDIA Titan cards, which constitutes a 3.2 times speed-up over single-GPU training.
Software Dependencies No Our implementation is derived from the publicly available Caffe toolbox [13]... Optical flow is computed using the off-the-shelf GPU implementation of [2] from the Open CV toolbox. (Specific version numbers for Caffe or OpenCV are not provided, only the names of the toolboxes.)
Experiment Setup Yes The network weights are learnt using the mini-batch stochastic gradient descent with momentum (set to 0.9). At each iteration, a mini-batch of 256 samples is constructed... The learning rate is initially set to 10^-2, and then decreased according to a fixed schedule... when training a Conv Net from scratch, the rate is changed to 10^-3 after 50K iterations, then to 10^-4 after 70K iterations, and training is stopped after 80K iterations. In the fine-tuning scenario, the rate is changed to 10^-3 after 14K iterations, and training stopped after 20K iterations.