Deep Predictive Coding Networks for Video Prediction and Unsupervised Learning
Authors: William Lotter, Gabriel Kreiman, David Cox
ICLR 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Here, we explore prediction of future frames in a video sequence as an unsupervised learning rule for learning about the structure of the visual world. We show that these networks are able to robustly learn to predict the movement of synthetic (rendered) objects, and that in doing so, the networks learn internal representations that are useful for decoding latent object parameters (e.g. pose) that support object recognition with fewer training views. We also show that these networks can scale to complex natural image streams (car-mounted camera videos), capturing key aspects of both egocentric movement and the movement of objects in the visual scene, and the representation learned in this setting is useful for estimating the steering angle. Altogether, these results suggest that prediction represents a powerful framework for unsupervised learning, allowing for implicit learning of object and scene structure. |
| Researcher Affiliation | Academia | William Lotter, Gabriel Kreiman & David Cox Harvard University Cambridge, MA 02215, USA {lotter,davidcox}@fas.harvard.edu gabriel.kreiman@tch.harvard.edu |
| Pseudocode | Yes | Algorithm 1 Calculation of Pred Net states |
| Open Source Code | Yes | Code and video examples can be found at: https://coxlab.github.io/prednet/ |
| Open Datasets | Yes | Models were trained using the raw videos from the KITTI dataset (Geiger et al., 2013) [...] We tested on the Cal Tech Pedestrian dataset (Doll ar et al., 2009) [...] We used a dataset released by Comma.ai (Biasini et al., 2016) [...] Human3.6M (Ionescu et al., 2014) dataset |
| Dataset Splits | Yes | We used 16K sequences for training and 800 for both validation and testing. [...] 57 recording sessions used for training and 4 used for validation. [...] we used 5% of each video for validation and testing, chosen as a random continuous chunk, and discarded the 10 frames before and after the chosen segments from the training set. |
| Hardware Specification | No | The paper does not specify the hardware used for training or inference of the models. It only mentions "car-mounted camera videos" for data collection. |
| Software Dependencies | No | The paper mentions Keras ("We would also like to thank the developers of Keras (Chollet, 2016).") but does not provide a specific version number. It does not list any other software dependencies with version numbers. |
| Experiment Setup | Yes | the best performing models tended to have a loss solely concentrated at the lowest layer (i.e. λ0 = 1, λl>0 = 0) [...] the model shown has 5 layers with 3x3 filter sizes for all convolutions, max-pooling of stride 2, and number of channels per layer, for both Al and Rl units, of (1, 32, 64, 128, 256). Model weights were optimized using the Adam algorithm (Kingma & Ba, 2014). [...] a 4 layer model with 3x3 convolutions and layer channel sizes of (3, 48, 96, 192). Models were again trained with Adam [...] Adam parameters were initially set to their default values (α = 0.001, β1 = 0.9, β2 = 0.999) with the learning rate, α, decreasing by a factor of 10 halfway through training. |