Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
ArCHer: Training Language Model Agents via Hierarchical Multi-Turn RL
Authors: Yifei Zhou, Andrea Zanette, Jiayi Pan, Sergey Levine, Aviral Kumar
ICML 2024 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Empirically, we find that Ar CHer significantly improves efficiency and performance on agent tasks, attaining a sample efficiency of about 100x over existing methods, while also improving with larger model capacity (upto the 7 billion scale). 5. Experiments The goal of our experiments is to evaluate the efficacy of hierarchical RL algorithms derived from Ar CHer. |
| Researcher Affiliation | Collaboration | 1University of California, Berkeley 2Google Deepmind. |
| Pseudocode | Yes | The algorithms derived from the Ar CHer framework so far are summarized in Algorithm 1. Algorithm 1 Ar CHer: Practical Framework |
| Open Source Code | Yes | 1The project page is https://yifeizhou02.github.io/archer.io/ and code can be found at https://github.com/Yifei Zhou02/Ar CHer. |
| Open Datasets | Yes | Detective Game (Hausknecht et al., 2019)... Twenty Questions and Twenty Questions Subset (Abdulhai et al., 2023)... Guess My City (Abdulhai et al., 2023)... Web Shop (Yao et al., 2023a). We use the official offline dataset provided by Abdulhai et al. (2023) with 100K simulated episodes. |
| Dataset Splits | No | The paper mentions using "SFT dataset" for initialization and "offline dataset" for some tasks, but does not specify explicit training, validation, or test dataset splits (e.g., percentages or sample counts) for reproducibility. |
| Hardware Specification | Yes | TRC TPU credit donations from Google Cloud, and compute credits from the Center for AI Safety (CAIS). |
| Software Dependencies | No | The paper mentions specific models used (e.g., 'GPT-2', 'Ro BERTa-base model', 'flan-t5-small') but does not provide specific version numbers for software dependencies such as programming languages, deep learning frameworks (e.g., PyTorch, TensorFlow), or other libraries. |
| Experiment Setup | Yes | Table 2: Hyperparameters for All Experiments. This table lists specific values for 'actor lr', 'critic lr', 'batch size', 'rollout trajectories', 'replay buffer size', 'critic updates per iteration', 'discount', 'polyak alpha', 'PPO epochs', 'GAE lambda', 'clip range'. |