Efficient Exploration for LLMs
Authors: Vikranth Dwaracherla, Seyed Mohammad Asghari, Botao Hao, Benjamin Van Roy
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We present evidence of substantial benefit from efficient exploration in gathering human feedback to improve large language models. In our experiments, an agent sequentially generates queries while fitting a reward model to the feedback received. |
| Researcher Affiliation | Collaboration | 1Google DeepMind 2Stanford University. |
| Pseudocode | Yes | Algorithm 1 learning interface |
| Open Source Code | No | The paper references third-party tools and libraries used (e.g., 'enn library'), but does not explicitly state that its own source code is released or provide a link to it. |
| Open Datasets | Yes | Each prompt is sampled uniformly from the Anthropic Helpfulness Base train dataset. |
| Dataset Splits | No | The paper mentions using 'Anthropic Helpfulness Base train dataset' and 'Anthropic Helpfulness Base eval dataset' but does not explicitly describe a distinct 'validation' split or provide specific train/validation/test split percentages or sample counts. |
| Hardware Specification | No | The paper does not provide specific details about the hardware (e.g., GPU/CPU models, memory) used for running its experiments. |
| Software Dependencies | No | The paper mentions using 'ADAM' for optimization and the 'enn library', but it does not specify version numbers for these or other software dependencies. |
| Experiment Setup | Yes | at the start of each epoch of interaction, each agents receives a batch of B = 32 prompts... The replay buffers are first-in-first-out (FIFO) buffer, with a maximum capacity of C = 3200 data points. In our experiments, we set N = 100. |