A Control-Centric Benchmark for Video Prediction
Authors: Stephen Tian, Chelsea Finn, Jiajun Wu
ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We find empirically that for planning robotic manipulation, existing metrics can be unreliable at predicting execution success. To address this, we propose a benchmark for action-conditioned video prediction in the form of a control benchmark that evaluates a given model for simulated robotic manipulation through sampling-based planning.Our benchmark, Video Prediction for Visual Planning (VP2), includes simulated environments with 11 task categories and 310 task instance definitions, a full planning implementation, and training datasets containing scripted interaction trajectories for each task category. |
| Researcher Affiliation | Academia | Stephen Tian, Chelsea Finn, & Jiajun Wu Stanford University |
| Pseudocode | No | No pseudocode or algorithm blocks were found in the paper. |
| Open Source Code | Yes | We will open source the code and environments in the benchmark in an easy-to-use interface, in hopes that it will help drive research in video prediction for downstream control applications. |
| Open Datasets | Yes | Each environment in VP2 comes with datasets for video prediction model training. Each training dataset consists of trajectories with 35 timesteps, each containing 256 256 RGB image observations and the action taken at each step. Specifics for each environment dataset are as follows, with additional details in Appendix D: robosuite Tabletop environment: We include 50K trajectories of interactions collected with a hand-scripted policy to push a random object in the environment in a random direction. Object textures are randomized in each trajectory. Robo Desk environment: For each task instance, we include 5K trajectories collected with a handscripted policy, for a total of 35K trajectories. To encourage the dataset to contain trajectories of varying success rates, we apply independent Gaussian noise to each dimension of every action from the scripted policy before executing it. We have also released the training datasets and pre-trained cost function weights at that link. https://github.com/s-tian/vp2 |
| Dataset Splits | Yes | Currently, models are first trained on video datasets widely adopted by the community (Ionescu et al., 2014; Geiger et al., 2013; Dasari et al., 2019) and then evaluated on held-out test sets using a variety of perceptual metrics. Those include metrics developed for image and video comparisons (Wang et al., 2004), as well as recently introduced deep perceptual metrics (Zhang et al., 2018; Unterthiner et al., 2018). However, it is an open question whether perceptual metrics are predictive of other qualities, such as planning abilities for an embodied agent. In this work, we take a step towards answering this question for one specific situation: how can we compare action-conditioned video prediction models in downstream robotic control? We propose a benchmark for video prediction that is centered around robotic manipulation performance. Our benchmark, which we call the Video Prediction for Visual Planning Benchmark (VP2), evaluates predictive models on manipulation planning performance by standardizing all elements of a control setup except the video predictor. It includes simulated environments, specific start/goal task instance specifications, training datasets of noisy expert video interaction data, and a fully configured modelbased control algorithm. For control, our benchmark uses visual foresight (Finn & Levine, 2017; Ebert et al., 2018), a modelpredictive control method previously applied to robotic manipulation. Visual foresight performs planning towards a specified goal by leveraging a video prediction model to simulate candidate action sequences and then scoring them based on the similarity between their predicted futures and the goal. After optimizing with respect to the score (Rubinstein, 1999; de Boer et al., 2005; Williams et al., 2016), the best action sequence is executed for a single step, and replanning is performed at each step. This is a natural choice for our benchmark for two reasons: first, it is goal-directed, enabling a single model to be evaluated on many tasks, and second, it interfaces with models only by calling forward prediction, which avoids prescribing any particular model class or architecture. The main contribution of this work is a set of benchmark environments, training datasets, and control algorithms to isolate and evaluate the effects of prediction models on simulated robotic manipulation performance. Specifically, we include two simulated multi-task robotic manipulation settings with a total of 310 task instance definitions, datasets containing 5000 noisy expert demonstration trajectories for each of 11 tasks, and a modular and lightweight implementation of visual foresight. Starting state Model A Metrics Model B Metrics Control Success! Control Failure Figure 1: Models that score well on perceptual metrics may generate crisp but physically infeasible predictions that lead to planning failures. Here, Model A predicts that that the slide will move on its own. Through our experiments, we find that models that score well on frequently used metrics can suffer when used in the context of control, as shown in Figure 1. Then, to explore how we can develop better models for control, we leverage our benchmark to analyze other questions such as the effects of model size, data quantity, and modeling uncertainty. We empirically test recent video prediction models, including recurrent variational models as well as a diffusion modeling approach. We will open source the code and environments in the benchmark in an easy-to-use interface, in hopes that it will help drive research in video prediction for downstream control applications. 2 RELATED WORK Evaluating video prediction models. Numerous evaluation procedures have been proposed for video prediction models. One widely adopted approach is to train models on standardized datasets (Geiger et al., 2013; Ionescu et al., 2014; Srivastava et al., 2015; Cordts et al., 2016; Finn et al., 2016; Dasari et al., 2019) and then compare predictions to ground truth samples across several metrics on a held-out test set. These metrics include several image metrics adapted to the video case, such as the widely used ℓ2 per-pixel Euclidean distance and Peak signal-to-noise ratio (PSNR). Image metrics developed to correlate more specifically with human perceptual judgments include structural similarity (SSIM) (Wang et al., 2004), as well as recently introduced deep perceptual metrics (Zhang et al., 2018; Unterthiner et al., 2018). FVD (Unterthiner et al., 2018) extends FID to the video domain via a pre-trained 3D convolutional network. While these metrics have been shown to correlate well with human perception, it is not clear whether they are indicative of performance on control tasks. Geng et al. (2022) develop correspondence-wise prediction losses, which use optical flow estimates to make losses robust to positional errors. These losses may improve control performance and are an orthogonal direction to our benchmark. Another class of evaluation methods judges a model s ability to make predictions about the outcomes of particular physical events, such as whether objects will collide or fall over (Sanborn et al., 2013; Battaglia et al., 2013; Bear et al., 2021). This excludes potentially extraneous information from the rest of the frame. Our benchmark similarly measures only task-relevant components of predicted videos, but does so through the lens of overall control success rather than hand-specified questions. Oh et al. (2015) evaluate action-conditioned video prediction models on Atari games by training a Q-learning agent using predicted data. We evaluate learned models for planning rather than policy learning, and extend our evaluation to robotic manipulation domains. Published as a conference paper at ICLR 2023 Perceptual Control Model Loss FVD LPIPS* SSIM Success Fit Vid MSE 30.7 3.4 87.8 65% +LPIPS=1 18.0 2.8 89.3 67% +LPIPS=10 24.3 4.1 84.6 35% SVG MSE 51.7 5.1 82.7 80% +LPIPS=1 40.7 4.4 83.2 80% +LPIPS=10 45.1 4.8 81.8 37% (a) robosuite pushing tasks Perceptual Control Model Loss FVD LPIPS* SSIM Success Fit Vid MSE 9.0 0.62 97.4 58% +LPIPS=1 5.9 0.63 97.5 82% +LPIPS=10 6.8 0.70 97.3 32% SVG MSE 10.6 0.97 95.3 70% +LPIPS=1 7.2 0.89 95.5 73% +LPIPS=10 24.2 1.1 94.0 10% (b) Robo Desk: push red button Perceptual Control Model Loss FVD LPIPS* SSIM Success Fit Vid MSE 20.5 1.25 94.4 50% +LPIPS=1 9.8 1.26 93.3 75% +LPIPS=10 7.3 1.30 92.8 83% SVG MSE 18.3 1.68 91.3 47% +LPIPS=1 11.0 1.58 90.9 68% +LPIPS=10 18.9 1.76 90.4 20% (c) Robo Desk: upright block off table Perceptual Control Model Loss FVD LPIPS* SSIM Success Fit Vid MSE 15.1 1.08 95.8 38% +LPIPS=1 10.2 1.08 94.9 36% +LPIPS=10 9.8 1.39 93.6 13% SVG MSE 22.5 1.88 90.6 58% +LPIPS=1 4.9 2.06 89.7 10% +LPIPS=10 22.6 2.48 88.2 10% (d) Robo Desk: open slide Table 1: Perceptual metrics and control performance for models trained using a MSE objective, as well as with added perceptual losses. For each metric, the bolded number shows the best value for that task. *LPIPS scores are scaled by 100 for convenient display. Full results can be found in Appendix G. Benchmarks for model-based and offline RL. Many works in model-based reinforcement learning evaluate on simulated RL benchmarks (Brockman et al., 2016; Tassa et al., 2018; Ha & Schmidhuber, 2018; Rajeswaran et al., 2018; Yu et al., 2019; Ahn et al., 2019; Zhu et al., 2020; Kannan et al., 2021), while real-world evaluation setups are often unstandardized. Offline RL and imitation learning benchmarks (Zhu et al., 2020; Fu et al., 2020; Gulcehre et al., 2020; Lu et al., 2022) provide training datasets along with environments. Our benchmark includes environments based on the infrastructure of robosuite (Zhu et al., 2020) and Robo Desk (Kannan et al., 2021), but it further includes task specifications in the form of goal images, cost functions for planning, as well as implementations of planning algorithms. Additionally, offline RL benchmarks mostly analyze model-free algorithms, while in this paper we focus on model-based methods. Because planning using video prediction models is sensitive to details such as control frequency, planning horizon, and cost function, our benchmark supplies all aspects other than the predictive model itself. 3 THE MISMATCH BETWEEN PERCEPTUAL METRICS AND CONTROL In this section, we present a case study that analyzes whether existing metrics for video prediction are indicative of performance on downstream control tasks. We focus on two variational video prediction models that have competitive prediction performance and are fast enough for planning: Fit Vid (Babaeizadeh et al., 2021) and the modified version of the SVG model (Denton & Fergus, 2018) introduced by Villegas et al. (2019), which contains convolutional as opposed to fullyconnected LSTM cells and uses the first four blocks of VGG19 as the encoder/decoder architecture. We denote this model as SVG . We perform experiments on two tabletop manipulation environments, robosuite and Robo Desk, which each admit multiple potential downstream task goals. Additional environment details are in Section 4. When selecting models to analyze, our goal is to train models that have varying performance on existing metrics. One strategy for learning models that align better with human perceptions of realism is to add auxiliary perceptual losses such as LPIPS (Zhang et al., 2018). Thus, for each environment, we train three variants of both the Fit Vid and SVG video prediction models. One variant is trained with a standard pixel-wise ℓ2 reconstruction loss (MSE), while the other two are trained using an additional perceptual loss in the form of adding the LPIPS score with VGG features implemented by Kastryulin et al. (2022) between the predicted and ground truth images at weightings 1 and 10. We train each model for 150K gradient steps. We then evaluate each model in terms of FVD (Unterthiner et al., 2018), LPIPS (Zhang et al., 2018), and SSIM (Wang et al., 2004) on held-out validation sets, as well as planning performance on robotic manipulation via visual foresight (Finn & Levine, 2017; |
| Hardware Specification | Yes | We use a batch size of 200 samples and one NVIDIA Titan RTX GPU. |
| Software Dependencies | No | No specific version numbers for software dependencies were mentioned, only names like "Py Torch". |
| Experiment Setup | Yes | We train Fit Vid at 64 64 image resolution to predict 10 future frames given 2 future frames. The training hyperparameters are shown in Table 3. Table 9 details the hyperparameters that we use for planning for each task category. The architecture is described in Table 10. |