Guide Your Agent with Adaptive Multimodal Rewards

Authors: Changyeon Kim, Younggyo Seo, Hao Liu, Lisa Lee, Jinwoo Shin, Honglak Lee, Kimin Lee

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We design our experiments to investigate the following questions: 1. Can our method prevent agents from pursuing undesired goals in test environments? (see Section 4.1 and Section 4.2) 2. Can ARP follow unseen text instructions? (see Section 4.3) 3. Is ARP comparable to goal image-conditioned policy? (see Section 4.4) 4. Can ARP induce well-aligned representation in test environments? (see Section 4.5) 5. What is the effect of each component in our framework? (see Section 4.6)
Researcher Affiliation Collaboration Changyeon Kim1 Younggyo Seo2 Hao Liu3 Lisa Lee4 Jinwoo Shin1 Honglak Lee5,6 Kimin Lee1 1KAIST 2Dyson Robot Learning Lab 3UC Berkeley 4Google Deep Mind 5University of Michigan 6LG AI Research
Pseudocode No The paper describes its models and architectures using mathematical equations and textual descriptions but does not include any explicitly labeled “Pseudocode” or “Algorithm” blocks.
Open Source Code Yes Source code and expert demonstrations used for our experiments are available at https://github.com/csmile-1006/ARP.git.
Open Datasets Yes We evaluate our method on three different environments proposed in Di Langosco et al. [15], which are variants derived from Open AI Procgen benchmarks [10]. We also demonstrate the effectiveness of our framework on RLBench [34], which serves as a standard benchmark for visual-based robotic manipulations. Our code and datasets are available at https://github.com/csmile-1006/ARP.git.
Dataset Splits No The paper mentions using “lowest validation loss” during CLIP fine-tuning, but it does not specify explicit training/validation/test dataset splits (e.g., percentages or counts) for the main experiments.
Hardware Specification Yes We use 24 CPU cores (Intel Xeon CPU @ 2.2GHz) and 2 GPUs (NVIDIA A100 40GB GPU) for training return-conditioned policy.
Software Dependencies No The paper mentions using the “open-sourced pre-trained CLIP model” and “huggingface transformers library” but does not specify exact version numbers for these or other key software dependencies like Python or PyTorch.
Experiment Setup Yes All models are trained for 50 epochs on two GPUs with a batch size 64 and a context length of 4. Policy batch size 64 Policy epochs 50 Policy context length 4 Policy learning rate 0.0005 Policy optimizer Adam W [47] Policy optimizer momentum β1 = 0.9, β2 = 0.999 Policy weight decay 0.00005