Visual Representation Learning with Stochastic Frame Prediction
Authors: Huiwon Jang, Dongyoung Kim, Junsu Kim, Jinwoo Shin, Pieter Abbeel, Younggyo Seo
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Through extensive experiments, we show that RSP can effectively learn image representations from a large real-world video dataset. Pre-trained on Kinetics-400 dataset (Kay et al., 2017), RSP achieves competitive or superior performance to various self-supervised learning baselines on a variety of tasks from vision-based robot learning benchmarks (James et al., 2020; Majumdar et al., 2023) and video label propagation benchmarks (Pont-Tuset et al., 2017; Zhou et al., 2018; Jhuang et al., 2013). In particular, RSP achieves a 36.0% average success rate in challenging robotic manipulation tasks from RLBench (James et al., 2020), while MAE baseline only achieves a 13.5% success rate. |
| Researcher Affiliation | Collaboration | 1KAIST 2UC Berkeley 3Now at Dyson Robot Learning Lab. Correspondence to: Huiwon Jang <huiwoen0516@kaist.ac.kr>, Younggyo Seo <seo0gyo@gmail.com>. |
| Pseudocode | Yes | Algorithm 1 RSP: Py Torch-like Pseudocode |
| Open Source Code | Yes | Code is available on the project webpage: https://sites.google.com/view/2024rsp. |
| Open Datasets | Yes | Pre-trained on Kinetics-400 dataset (Kay et al., 2017), RSP achieves competitive or superior performance to various self-supervised learning baselines on a variety of tasks from vision-based robot learning benchmarks (James et al., 2020; Majumdar et al., 2023) and video label propagation benchmarks (Pont-Tuset et al., 2017; Zhou et al., 2018; Jhuang et al., 2013). |
| Dataset Splits | No | The paper mentions pre-training on Kinetics-400 and evaluating on various benchmarks (Cortex Bench, RLBench, Franka Kitchen, DAVIS, VIP, JHMDB) but does not provide specific train/validation/test split percentages or sample counts for these datasets that were used in their experiments. It states "We train the imitation learning agents using 100 demos for each task" but doesn't specify how validation was handled for these demos. |
| Hardware Specification | No | The paper does not provide specific hardware details such as GPU models, CPU types, or memory specifications used for running the experiments. It only mentions general training setup like 'We train the imitation learning agents...' |
| Software Dependencies | No | The paper mentions using "Adam W optimizer (Loshchilov & Hutter, 2019)" and building the framework "upon the official implementation of MAE (He et al., 2022)" and provides "PyTorch-like Pseudocode." However, it does not specify version numbers for PyTorch, Python, or any other critical software libraries or dependencies required to replicate the experiments. |
| Experiment Setup | Yes | Pre-training For a fair comparison, we report all the experimental results using the Vi T-S/16 model pre-trained on Kinetics-400 datasets (Kay et al., 2017) for 400 epochs. We use the repeated sampling of 2 and count the epochs as effective epochs (Hoffer et al., 2020; Feichtenhofer et al., 2022). For sampling frames xt and xt+k, we follow Gupta et al. (2023) that randomly samples k from 4 to 48. We implement our decoder block to sequentially have self-attention, cross-attention, and feedforward layers. For the MAE objective, we use a 75% masking ratio (He et al., 2022). We use Adam W optimizer (Loshchilov & Hutter, 2019) with a batch size of 1536. For all baselines, we use the default hyperparameters. We provide more details in Appendix A. (See also Table 5a: learning rate 1.5e-4, warmup epochs 40, batch size 1536, etc.) |