Unsupervised Learning of Video Representations using LSTMs
Authors: Nitish Srivastava, Elman Mansimov, Ruslan Salakhudinov
ICML 2015 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We experiment with two kinds of input sequences... We analyze the outputs of the model qualitatively... We further evaluate the representations by finetuning them for a supervised learning problem human action recognition on the UCF-101 and HMDB-51 datasets. We show that the representations help improve classification accuracy, especially when there are only few training examples. |
| Researcher Affiliation | Academia | Nitish Srivastava NITISH@CS.TORONTO.EDU Elman Mansimov EMANSIM@CS.TORONTO.EDU Ruslan Salakhutdinov RSALAKHU@CS.TORONTO.EDU University of Toronto, 6 Kings College Road, Toronto, ON M5S 3G4 CANADA |
| Pseudocode | No | The paper does not contain any pseudocode or clearly labeled algorithm blocks. |
| Open Source Code | No | The paper does not mention releasing source code for the methodology described. |
| Open Datasets | Yes | We use the UCF-101 and HMDB-51 datasets for supervised tasks. The UCF-101 dataset (Soomro et al., 2012)... The HMDB-51 dataset (Kuehne et al., 2011)... To train the unsupervised models, we used a subset of the You Tube videos from the Sports-1M dataset (Karpathy et al., 2014). |
| Dataset Splits | Yes | The UCF-101 dataset (Soomro et al., 2012) contains 13,320 videos... The dataset has 3 standard train/test splits with the training set containing around 9,500 videos in each split (the rest are test). The HMDB-51 dataset (Kuehne et al., 2011) contains 5100 videos... This also has 3 train/test splits with 3570 videos in the training set and rest in test. |
| Hardware Specification | No | The paper mentions 'NVIDIA Corporation with the donation of a GPU used for this research' in the acknowledgments. However, it does not specify the exact model of the GPU or any other detailed hardware specifications (e.g., CPU, memory). |
| Software Dependencies | No | The paper mentions that 'Our implementation of LSTMs follows closely the one discussed by Graves (2013)' and 'Percepts were extracted using the convolutional neural net model of Simonyan & Zisserman (2014b)'. However, it does not provide specific version numbers for any software libraries, frameworks, or tools used. |
| Experiment Setup | Yes | We first trained the models on a dataset of moving MNIST digits. The LSTM had 2048 units. The encoder took 10 frames as input. The decoder tried to reconstruct these 10 frames and the future predictor attempted to predict the next 10 frames. We used logistic output units with a cross entropy loss function... we trained a two layer Composite Model, with each layer having 2048 units. |