What Matters to You? Towards Visual Representation Alignment for Robot Learning

Authors: Thomas Tian, Chenfeng Xu, Masayoshi Tomizuka, Jitendra Malik, Andrea Bajcsy

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Across experiments in X-MAGICAL and in robotic manipulation, we find that RAPL s reward consistently generates preferred robot behaviors with high sample efficiency, and shows strong zero-shot generalization when the visual representation is learned from a different embodiment than the robot s.
Researcher Affiliation Academia Ran Tian1, Chenfeng Xu1, Masayoshi Tomizuka1, Jitendra Malik1, Andrea Bajcsy2 1UC Berkeley 2Carnegie Mellon University
Pseudocode No The paper describes methodological steps in prose, but does not include any formally labeled 'Pseudocode' or 'Algorithm' block.
Open Source Code No The paper does not include an explicit statement about releasing the source code for the methodology described, nor does it provide a link to a code repository.
Open Datasets Yes We first experiment in the toy X-Magical environment (Zakka et al., 2022), and then move to the realistic Isaac Gym simulator.
Dataset Splits No The paper mentions running '5 trials with different random seeds' but does not specify explicit training, validation, and test dataset splits with percentages or sample counts.
Hardware Specification No The paper does not provide specific details about the hardware (e.g., GPU, CPU models, memory) used for running the experiments.
Software Dependencies No The paper mentions various software components and models like 'Soft-Actor Critic', 'Res Net-18', and 'Isaac Gym', but does not specify their version numbers or any other software dependencies with version information.
Experiment Setup Yes For all policy learning experiments, we use 10 expert demonstrations as the demonstration set D+ for generating the reward (more details in Appendix A.3). We use the same preference dataset with 150 triplets for training RLHF and RAPL. We use the same setup as in (Zakka et al., 2022) with the Res Net-18 backbone pre-trained on Image Net. The original classification head is replaced with a linear layer that outputs a 32-dimensional vector as our embedding space, ΦR := R32. The TCC representation model is trained with 500 demonstrations using the code from (Zakka et al., 2022). Both RAPL and RLHF only fine-tune the last linear layer. All representation models are frozen during policy learning.