LIV: Language-Image Representations and Rewards for Robotic Control

Authors: Yecheng Jason Ma, Vikash Kumar, Amy Zhang, Osbert Bastani, Dinesh Jayaraman

ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We perform extensive experimental evaluations on several simulated and real-world household robotic manipulation settings. Our experiments evaluate LIV vision-language representations not only in their capacity as input state representations for language-conditioned behavior cloning of task policies, but also to directly ground language-based task specifications into visual state-based rewards for robot trajectory optimization. In many cases, the pre-trained LIV model, without ever seeing robots in its pre-training human video dataset, can zero-shot produce dense language-conditioned reward on unseen robot videos.
Researcher Affiliation Collaboration 1University of Pennsylvania 2Meta AI.
Pseudocode Yes Pseudocode is presented in Algorithm 1.
Open Source Code Yes LIV model and training code are released: github.com/penn-pal-lab/LIV
Open Datasets Yes We pre-train LIV on Epic Kitchen (Damen et al., 2018), a text-annotated ego-centric video dataset of humans completing tasks in diverse household kitchens; this dataset consists of 90k video segments, totalling 20M frames and 20k unique text annotations, and offers diverse camera views and action-centric videos, making it an ideal choice for vision-language pre-training.
Dataset Splits No The paper mentions evaluating on 'test split' of Epic Kitchen and using 'best training checkpoints' which implies validation, but does not provide specific details on how validation sets were created (e.g., percentages, sample counts, or explicit splitting methodology) for their experiments across all datasets used.
Hardware Specification Yes The pre-training takes place on a node of 8 NVIDIA V100 GPUs.
Software Dependencies No The paper mentions using 'CLIP architecture', 'Res Net50', 'CLIP Transformer', and 'Adam' optimizer, but it does not specify version numbers for any software libraries, frameworks (e.g., PyTorch, TensorFlow), or the CLIP implementation itself.
Experiment Setup Yes Table 2. VIP Architecture & Pre-Training Hyperparameters. Table 4. LCBC Hyperparameters.