ELLA: Exploration through Learned Language Abstraction
Authors: Suvir Mirchandani, Siddharth Karamcheti, Dorsa Sadigh
NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We empirically validate our abstraction-based reward shaping framework on a series of tasks via the Baby AI platform [11]. We compare against a standard RL baseline as well as to a strong languagebased reward shaping approach [17], and find that our method leads to substantial gains in sample efficiency across a variety of instruction following tasks. 5 Experiments Results. Figure 2 presents learning curves for ELLA, LEARN, and PPO (without shaping) across the six environments. |
| Researcher Affiliation | Academia | Suvir Mirchandani Computer Science Stanford University suvir@cs.stanford.edu Siddharth Karamcheti Computer Science Stanford University skaramcheti@cs.stanford.edu Dorsa Sadigh Computer Science and Electrical Engineering Stanford University dorsa@cs.stanford.edu |
| Pseudocode | Yes | Algorithm 1 Reward Shaping via ELLA |
| Open Source Code | Yes | 1Our code is available at https://github.com/Stanford-ILIAD/ELLA. |
| Open Datasets | Yes | We use Baby AI [11] as the platform for our tasks. Baby AI s language is synthetic, similar to prior work examining RL for instruction following at scale [20, 21]. |
| Dataset Splits | No | The paper uses the Baby AI platform for generating environments and training RL agents. While it mentions 'validation accuracy' for a learned classifier (fρ), it does not specify explicit train/validation/test splits (percentages, counts, or predefined citations) for the environment instances or collected trajectories used in the main RL experiments. |
| Hardware Specification | Yes | We used a combination of NVIDIA Titan and Tesla T4 GPUs to train our models. |
| Software Dependencies | No | The paper mentions using Proximal Policy Optimization (PPO) and the Baby AI platform, but it does not specify version numbers for any software dependencies, libraries, or frameworks used in the experiments. |
| Experiment Setup | Yes | Rewards in Baby AI are sparse: agents receive a reward of 1 − 0.9 t H where t is the time step upon succeeding at the high-level goal. If the goal is not reached, the reward is 0. By default, all rewards are scaled up by a constant factor of 20. The results in Section 5 use λ=0.25 for PUTNEXT, UNLOCK, and COMBO, and use λ = 0.5 for OPEN&PICK and SEQUENCE (which have longer horizons H). |