reproducibilityindex.ai

Deep Predictive Coding Networks for Video Prediction and Unsupervised Learning

Authors: William Lotter, Gabriel Kreiman, David Cox

ICLR 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Here, we explore prediction of future frames in a video sequence as an unsupervised learning rule for learning about the structure of the visual world. We show that these networks are able to robustly learn to predict the movement of synthetic (rendered) objects, and that in doing so, the networks learn internal representations that are useful for decoding latent object parameters (e.g. pose) that support object recognition with fewer training views. We also show that these networks can scale to complex natural image streams (car-mounted camera videos), capturing key aspects of both egocentric movement and the movement of objects in the visual scene, and the representation learned in this setting is useful for estimating the steering angle. Altogether, these results suggest that prediction represents a powerful framework for unsupervised learning, allowing for implicit learning of object and scene structure.
Researcher Affiliation	Academia	William Lotter, Gabriel Kreiman & David Cox Harvard University Cambridge, MA 02215, USA {lotter,davidcox}@fas.harvard.edu gabriel.kreiman@tch.harvard.edu
Pseudocode	Yes	Algorithm 1 Calculation of Pred Net states
Open Source Code	Yes	Code and video examples can be found at: https://coxlab.github.io/prednet/
Open Datasets	Yes	Models were trained using the raw videos from the KITTI dataset (Geiger et al., 2013) [...] We tested on the Cal Tech Pedestrian dataset (Doll ar et al., 2009) [...] We used a dataset released by Comma.ai (Biasini et al., 2016) [...] Human3.6M (Ionescu et al., 2014) dataset
Dataset Splits	Yes	We used 16K sequences for training and 800 for both validation and testing. [...] 57 recording sessions used for training and 4 used for validation. [...] we used 5% of each video for validation and testing, chosen as a random continuous chunk, and discarded the 10 frames before and after the chosen segments from the training set.
Hardware Specification	No	The paper does not specify the hardware used for training or inference of the models. It only mentions "car-mounted camera videos" for data collection.
Software Dependencies	No	The paper mentions Keras ("We would also like to thank the developers of Keras (Chollet, 2016).") but does not provide a specific version number. It does not list any other software dependencies with version numbers.
Experiment Setup	Yes	the best performing models tended to have a loss solely concentrated at the lowest layer (i.e. λ0 = 1, λl>0 = 0) [...] the model shown has 5 layers with 3x3 ﬁlter sizes for all convolutions, max-pooling of stride 2, and number of channels per layer, for both Al and Rl units, of (1, 32, 64, 128, 256). Model weights were optimized using the Adam algorithm (Kingma & Ba, 2014). [...] a 4 layer model with 3x3 convolutions and layer channel sizes of (3, 48, 96, 192). Models were again trained with Adam [...] Adam parameters were initially set to their default values (α = 0.001, β1 = 0.9, β2 = 0.999) with the learning rate, α, decreasing by a factor of 10 halfway through training.