reproducibilityindex.ai

BAGEL: Bootstrapping Agents by Guiding Exploration with Language

Authors: Shikhar Murty, Christopher D Manning, Peter Shaw, Mandar Joshi, Kenton Lee

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	This work presents BAGEL, a method for bootstrapping LM agents without human supervision. BAGEL converts a seed set of randomly explored trajectories or synthetic instructions, into demonstrations, via round-trips between two noisy LM components: an LM labeler which converts a trajectory into a synthetic instruction, and a zero-shot LM agent which maps the synthetic instruction into a refined trajectory. By performing these roundtrips iteratively, BAGEL quickly converts the initial distribution of trajectories towards those that are well-described by natural language. We use BAGEL demonstrations to adapt a zero shot LM agent at test time via in-context learning over retrieved demonstrations, and find improvements of over 2-13% absolute on Tool QA and Mini Wob++, with up to 13 reduction in execution failures.
Researcher Affiliation	Collaboration	1Department of Computer Science, Stanford University 2Google Deepmind.
Pseudocode	No	The paper does not contain structured pseudocode or algorithm blocks. Figure 2 illustrates a process flow, but it is not pseudocode format.
Open Source Code	No	The paper does not provide concrete access to source code for the methodology described. There is no specific repository link or explicit statement about code release in supplementary materials.
Open Datasets	Yes	Our experiments are based on two environments, Mini Wo B++ (Shi et al., 2017; Liu et al., 2018) and Tool QA (Zhuang et al., 2023).
Dataset Splits	No	The paper does not provide specific dataset split information for a validation set (e.g., percentages, sample counts, or citations to predefined validation splits). It mentions evaluation on a subset of tasks for Mini Wo B++ and test evaluation for Tool QA, but no explicit validation split.
Hardware Specification	No	The paper does not provide specific hardware details such as GPU/CPU models, processor types, or memory amounts used for running its experiments. It only mentions using an "instruction tuned Pa LM-2" model.
Software Dependencies	No	The paper mentions using an "instruction tuned Pa LM-2" and a "T5-XXL model" for embedding, and refers to "Selenium web-driver method" and "Python function". However, it does not provide specific version numbers for these software components or any other libraries/solvers.
Experiment Setup	Yes	We use an instruction tuned Pa LM-2 (Anil et al., 2023) as the base LM for all our experiments, and sample with a fixed temperature of 1.0. We set the max episode length T to 15 for all datasets and models. We also set Titer to 5, when performing multiple iterations in BAGEL 1. For Mini Wo B++, we start with sampling 60 trajectories in the exploration phase for trajectoryfirst variants of BAGEL, and sample 60 synthetic goals for instruction-first variants. For Tool QA, we sample 200 trajectories for BAGEL (trajectory-first), and 200 synthetic goals for BAGEL (instruction-first).