ALFWorld: Aligning Text and Embodied Environments for Interactive Learning
Authors: Mohit Shridhar, Xingdi Yuan, Marc-Alexandre Cote, Yonatan Bisk, Adam Trischler, Matthew Hausknecht
ICLR 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We address this limitation by introducing ALFWorld, a simulator that enables agents to learn abstract, text-based policies in Text World (Côté et al., 2018) and then execute goals from the ALFRED benchmark (Shridhar et al., 2020) in a rich visual environment. ALFWorld enables the creation of a new BUTLER agent whose abstract knowledge, learned in Text World, corresponds directly to concrete, visually grounded actions. In turn, as we demonstrate empirically, this fosters better agent generalization than training only in the visually grounded environment. |
| Researcher Affiliation | Collaboration | University of Washington Microsoft Research, Montréal Carnegie Mellon University Microsoft Research |
| Pseudocode | No | The paper describes the architecture and components in detail with figures, but it does not include any formal pseudocode blocks or algorithms labeled as such. |
| Open Source Code | Yes | ALFWorld.github.io |
| Open Datasets | Yes | We build ALFWorld by extending two prior works: Text World (Côté et al., 2018) an engine for interactive text-based games, and ALFRED (Shridhar et al., 2020) a large scale dataset for vision-language instruction following in embodied environments. |
| Dataset Splits | Yes | For each task type we construct a larger train set, as well as seen and unseen validation evaluation sets: (1): seen consists of known task instances {task-type, object, receptacle, room} in rooms seen during training, but with different instantiations of object locations, quantities, and visual appearances (e.g. two blue pencils on a shelf instead of three red pencils in a drawer seen in training). |
| Hardware Specification | No | The paper mentions that "THOR instances use 100MB batch-size of GPU memory for rendering, whereas Text World instances are CPU-only and are thus much easier to scale up." However, it does not specify any particular GPU or CPU models, or other detailed hardware specifications. |
| Software Dependencies | No | The paper mentions several software components and tools such as "Mask R-CNN (He et al., 2017)", "Adam (Kingma and Ba, 2014)", "Text World (Côté et al., 2018)", "ALFRED (Shridhar et al., 2020)", "THOR simulator (Kolve et al., 2017)", "Fast Downward (Helmert, 2006)", and "BERT embeddings (Sanh et al., 2019)". While these are cited, the paper does not provide explicit version numbers for these software packages (e.g., Mask R-CNN vX.Y, Adam vA.B, Python 3.X, PyTorch 1.Y), which are required for full reproducibility. |
| Experiment Setup | Yes | For all experiments, we use Adam (Kingma and Ba, 2014) as the optimizer. The learning rate is set to 0.001 with a clip gradient norm of 5. During training with DAgger, we use a batch size of 10 to collect transitions... We set the max number of steps per episode to be 50... We linearly anneal the fraction of the expert s assistance from 100% to 1% across a window of 50K episodes. The agent is updated after every 5 steps of data collection. We sample a batch of 64 data points from the replay buffer... When using the beam search heuristic to recover from failed actions... we use a beam width of 10, and take the top-5 ranked outputs as candidates... All experiment settings in Text World are run with 8 random seeds. All text agents are trained for 50,000 episodes. |