BAGEL: Bootstrapping Agents by Guiding Exploration with Language

Authors: Shikhar Murty, Christopher D Manning, Peter Shaw, Mandar Joshi, Kenton Lee

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental This work presents BAGEL, a method for bootstrapping LM agents without human supervision. BAGEL converts a seed set of randomly explored trajectories or synthetic instructions, into demonstrations, via round-trips between two noisy LM components: an LM labeler which converts a trajectory into a synthetic instruction, and a zero-shot LM agent which maps the synthetic instruction into a refined trajectory. By performing these roundtrips iteratively, BAGEL quickly converts the initial distribution of trajectories towards those that are well-described by natural language. We use BAGEL demonstrations to adapt a zero shot LM agent at test time via in-context learning over retrieved demonstrations, and find improvements of over 2-13% absolute on Tool QA and Mini Wob++, with up to 13 reduction in execution failures.
Researcher Affiliation Collaboration 1Department of Computer Science, Stanford University 2Google Deepmind.
Pseudocode No The paper does not contain structured pseudocode or algorithm blocks. Figure 2 illustrates a process flow, but it is not pseudocode format.
Open Source Code No The paper does not provide concrete access to source code for the methodology described. There is no specific repository link or explicit statement about code release in supplementary materials.
Open Datasets Yes Our experiments are based on two environments, Mini Wo B++ (Shi et al., 2017; Liu et al., 2018) and Tool QA (Zhuang et al., 2023).
Dataset Splits No The paper does not provide specific dataset split information for a validation set (e.g., percentages, sample counts, or citations to predefined validation splits). It mentions evaluation on a subset of tasks for Mini Wo B++ and test evaluation for Tool QA, but no explicit validation split.
Hardware Specification No The paper does not provide specific hardware details such as GPU/CPU models, processor types, or memory amounts used for running its experiments. It only mentions using an "instruction tuned Pa LM-2" model.
Software Dependencies No The paper mentions using an "instruction tuned Pa LM-2" and a "T5-XXL model" for embedding, and refers to "Selenium web-driver method" and "Python function". However, it does not provide specific version numbers for these software components or any other libraries/solvers.
Experiment Setup Yes We use an instruction tuned Pa LM-2 (Anil et al., 2023) as the base LM for all our experiments, and sample with a fixed temperature of 1.0. We set the max episode length T to 15 for all datasets and models. We also set Titer to 5, when performing multiple iterations in BAGEL 1. For Mini Wo B++, we start with sampling 60 trajectories in the exploration phase for trajectoryfirst variants of BAGEL, and sample 60 synthetic goals for instruction-first variants. For Tool QA, we sample 200 trajectories for BAGEL (trajectory-first), and 200 synthetic goals for BAGEL (instruction-first).