Two-Stream Convolutional Networks for Action Recognition in Videos
Authors: Karen Simonyan, Andrew Zisserman
NeurIPS 2014 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our architecture is trained and evaluated on the standard video actions benchmarks of UCF-101 and HMDB-51, where it is competitive with the state of the art. |
| Researcher Affiliation | Academia | Karen Simonyan Andrew Zisserman Visual Geometry Group, University of Oxford {karen,az}@robots.ox.ac.uk |
| Pseudocode | No | The paper describes methods and architectures in text and figures, but does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks, nor structured, code-like steps. |
| Open Source Code | No | Our implementation is derived from the publicly available Caffe toolbox [13], but contains a number of significant modifications, including parallel training on multiple GPUs installed in a single system. (No explicit statement of their code being released or a link to it.) |
| Open Datasets | Yes | The evaluation is performed on UCF-101 [24] and HMDB-51 [16] action recognition benchmarks, which are among the largest available annotated video datasets |
| Dataset Splits | No | The evaluation protocol is the same for both datasets: the organisers provide three splits into training and test data, and the performance is measured by the mean classification accuracy across the splits. (The paper only explicitly states train/test splits provided by organizers, not a specific validation split for their own experiments on UCF/HMDB, though a validation set is mentioned for ImageNet pre-training and implicit for fine-tuning.) |
| Hardware Specification | Yes | Training a single temporal Conv Net takes 1 day on a system with 4 NVIDIA Titan cards, which constitutes a 3.2 times speed-up over single-GPU training. |
| Software Dependencies | No | Our implementation is derived from the publicly available Caffe toolbox [13]... Optical flow is computed using the off-the-shelf GPU implementation of [2] from the Open CV toolbox. (Specific version numbers for Caffe or OpenCV are not provided, only the names of the toolboxes.) |
| Experiment Setup | Yes | The network weights are learnt using the mini-batch stochastic gradient descent with momentum (set to 0.9). At each iteration, a mini-batch of 256 samples is constructed... The learning rate is initially set to 10^-2, and then decreased according to a fixed schedule... when training a Conv Net from scratch, the rate is changed to 10^-3 after 50K iterations, then to 10^-4 after 70K iterations, and training is stopped after 80K iterations. In the fine-tuning scenario, the rate is changed to 10^-3 after 14K iterations, and training stopped after 20K iterations. |