ELLA: Exploration through Learned Language Abstraction

Authors: Suvir Mirchandani, Siddharth Karamcheti, Dorsa Sadigh

NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We empirically validate our abstraction-based reward shaping framework on a series of tasks via the Baby AI platform [11]. We compare against a standard RL baseline as well as to a strong languagebased reward shaping approach [17], and find that our method leads to substantial gains in sample efficiency across a variety of instruction following tasks. 5 Experiments Results. Figure 2 presents learning curves for ELLA, LEARN, and PPO (without shaping) across the six environments.
Researcher Affiliation Academia Suvir Mirchandani Computer Science Stanford University suvir@cs.stanford.edu Siddharth Karamcheti Computer Science Stanford University skaramcheti@cs.stanford.edu Dorsa Sadigh Computer Science and Electrical Engineering Stanford University dorsa@cs.stanford.edu
Pseudocode Yes Algorithm 1 Reward Shaping via ELLA
Open Source Code Yes 1Our code is available at https://github.com/Stanford-ILIAD/ELLA.
Open Datasets Yes We use Baby AI [11] as the platform for our tasks. Baby AI s language is synthetic, similar to prior work examining RL for instruction following at scale [20, 21].
Dataset Splits No The paper uses the Baby AI platform for generating environments and training RL agents. While it mentions 'validation accuracy' for a learned classifier (fρ), it does not specify explicit train/validation/test splits (percentages, counts, or predefined citations) for the environment instances or collected trajectories used in the main RL experiments.
Hardware Specification Yes We used a combination of NVIDIA Titan and Tesla T4 GPUs to train our models.
Software Dependencies No The paper mentions using Proximal Policy Optimization (PPO) and the Baby AI platform, but it does not specify version numbers for any software dependencies, libraries, or frameworks used in the experiments.
Experiment Setup Yes Rewards in Baby AI are sparse: agents receive a reward of 1 − 0.9 t H where t is the time step upon succeeding at the high-level goal. If the goal is not reached, the reward is 0. By default, all rewards are scaled up by a constant factor of 20. The results in Section 5 use λ=0.25 for PUTNEXT, UNLOCK, and COMBO, and use λ = 0.5 for OPEN&PICK and SEQUENCE (which have longer horizons H).