BAGEL: Bootstrapping Agents by Guiding Exploration with Language
Authors: Shikhar Murty, Christopher D Manning, Peter Shaw, Mandar Joshi, Kenton Lee
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | This work presents BAGEL, a method for bootstrapping LM agents without human supervision. BAGEL converts a seed set of randomly explored trajectories or synthetic instructions, into demonstrations, via round-trips between two noisy LM components: an LM labeler which converts a trajectory into a synthetic instruction, and a zero-shot LM agent which maps the synthetic instruction into a refined trajectory. By performing these roundtrips iteratively, BAGEL quickly converts the initial distribution of trajectories towards those that are well-described by natural language. We use BAGEL demonstrations to adapt a zero shot LM agent at test time via in-context learning over retrieved demonstrations, and find improvements of over 2-13% absolute on Tool QA and Mini Wob++, with up to 13 reduction in execution failures. |
| Researcher Affiliation | Collaboration | 1Department of Computer Science, Stanford University 2Google Deepmind. |
| Pseudocode | No | The paper does not contain structured pseudocode or algorithm blocks. Figure 2 illustrates a process flow, but it is not pseudocode format. |
| Open Source Code | No | The paper does not provide concrete access to source code for the methodology described. There is no specific repository link or explicit statement about code release in supplementary materials. |
| Open Datasets | Yes | Our experiments are based on two environments, Mini Wo B++ (Shi et al., 2017; Liu et al., 2018) and Tool QA (Zhuang et al., 2023). |
| Dataset Splits | No | The paper does not provide specific dataset split information for a validation set (e.g., percentages, sample counts, or citations to predefined validation splits). It mentions evaluation on a subset of tasks for Mini Wo B++ and test evaluation for Tool QA, but no explicit validation split. |
| Hardware Specification | No | The paper does not provide specific hardware details such as GPU/CPU models, processor types, or memory amounts used for running its experiments. It only mentions using an "instruction tuned Pa LM-2" model. |
| Software Dependencies | No | The paper mentions using an "instruction tuned Pa LM-2" and a "T5-XXL model" for embedding, and refers to "Selenium web-driver method" and "Python function". However, it does not provide specific version numbers for these software components or any other libraries/solvers. |
| Experiment Setup | Yes | We use an instruction tuned Pa LM-2 (Anil et al., 2023) as the base LM for all our experiments, and sample with a fixed temperature of 1.0. We set the max episode length T to 15 for all datasets and models. We also set Titer to 5, when performing multiple iterations in BAGEL 1. For Mini Wo B++, we start with sampling 60 trajectories in the exploration phase for trajectoryfirst variants of BAGEL, and sample 60 synthetic goals for instruction-first variants. For Tool QA, we sample 200 trajectories for BAGEL (trajectory-first), and 200 synthetic goals for BAGEL (instruction-first). |