LIV: Language-Image Representations and Rewards for Robotic Control
Authors: Yecheng Jason Ma, Vikash Kumar, Amy Zhang, Osbert Bastani, Dinesh Jayaraman
ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We perform extensive experimental evaluations on several simulated and real-world household robotic manipulation settings. Our experiments evaluate LIV vision-language representations not only in their capacity as input state representations for language-conditioned behavior cloning of task policies, but also to directly ground language-based task specifications into visual state-based rewards for robot trajectory optimization. In many cases, the pre-trained LIV model, without ever seeing robots in its pre-training human video dataset, can zero-shot produce dense language-conditioned reward on unseen robot videos. |
| Researcher Affiliation | Collaboration | 1University of Pennsylvania 2Meta AI. |
| Pseudocode | Yes | Pseudocode is presented in Algorithm 1. |
| Open Source Code | Yes | LIV model and training code are released: github.com/penn-pal-lab/LIV |
| Open Datasets | Yes | We pre-train LIV on Epic Kitchen (Damen et al., 2018), a text-annotated ego-centric video dataset of humans completing tasks in diverse household kitchens; this dataset consists of 90k video segments, totalling 20M frames and 20k unique text annotations, and offers diverse camera views and action-centric videos, making it an ideal choice for vision-language pre-training. |
| Dataset Splits | No | The paper mentions evaluating on 'test split' of Epic Kitchen and using 'best training checkpoints' which implies validation, but does not provide specific details on how validation sets were created (e.g., percentages, sample counts, or explicit splitting methodology) for their experiments across all datasets used. |
| Hardware Specification | Yes | The pre-training takes place on a node of 8 NVIDIA V100 GPUs. |
| Software Dependencies | No | The paper mentions using 'CLIP architecture', 'Res Net50', 'CLIP Transformer', and 'Adam' optimizer, but it does not specify version numbers for any software libraries, frameworks (e.g., PyTorch, TensorFlow), or the CLIP implementation itself. |
| Experiment Setup | Yes | Table 2. VIP Architecture & Pre-Training Hyperparameters. Table 4. LCBC Hyperparameters. |