Learning Reward for Robot Skills Using Large Language Models via Self-Alignment

Authors: Yuwei Zeng, Yao Mu, Lin Shao

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental The method was validated on 9 tasks across 2 simulation environments. It demonstrates a consistent improvement over training efficacy and efficiency, meanwhile consuming significantly fewer GPT tokens compared to the alternative mutation-based method.
Researcher Affiliation Academia 1National University of Singapore 2The University of Hong Kong.
Pseudocode Yes Algorithm 1 Self-Alignment Reward Update
Open Source Code No The paper lists a project website: 'Project website: https://sites.google.com/view/rewardselfalign.' However, this website does not contain direct links to the source code for the methodology described in the paper, stating 'Contact for Code' instead.
Open Datasets Yes Following the objective, we evaluated our framework on 6 manipulation tasks in Mani Skill2 (Gu et al., 2023) as illustrated in Fig 2. The tasks includes rigid and articulated object manipulation with fixed-based manipulator, single-arm and dual-arm mobile manipulator. We also compare the training and token efficiency with alternative unsupervised method proposed in Eureka for H3 on 3 tasks implemented with Isaac Gym (Makoviychuk et al., 2021).
Dataset Splits No The paper describes training policies in simulation environments and evaluating success rates over exploration steps. It does not provide explicit dataset splits (e.g., percentages or counts for training, validation, or testing sets) as is common in supervised learning contexts, because the data is generated dynamically through interaction with the environment.
Hardware Specification No The paper does not provide specific hardware details (e.g., GPU models, CPU models, or memory specifications) used for running its experiments.
Software Dependencies No The paper mentions using 'stablebaselines3', 'rl games', and 'APRe L' for policy training and reward updates, and 'GPT-4 with API model name gpt-4-0613' for LLM interaction. However, it does not provide specific version numbers for the 'stablebaselines3', 'rl games', or 'APRe L' libraries, which are required for a reproducible description of ancillary software.
Experiment Setup Yes For reward update, we set the rationality coefficient β = 0.9, feedback at every 10000 training steps for Mani Skill2, 100 epochs for Isaac Gym with M=5 roll-out samples from the latest policy and N=5 for sampling from the reward histogram. The training step equals the exploration step in our setup. For reward update with Metropolis-Hastings algorithm, our customized implementation is built upon APRe L (Bıyık et al., 2022b). The burn-in period is 200 iterations and the number of samples is 100. The proposal distribution follows Gaussian distribution as N( θ, 0.2) on normalized parameters then is clipped to [0, 1].