Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
LMRL Gym: Benchmarks for Multi-Turn Reinforcement Learning with Language Models
Authors: Marwa Abdulhai, Isadora White, Charlie Victor Snell, Charles Sun, Joey Hong, Yuexiang Zhai, Kelvin Xu, Sergey Levine
ICML 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our paper introduces the LMRL-Gym benchmark for evaluating multi-turn RL for LLMs, together with an open-source research framework for getting started on multi-turn RL with offline value-based and online policy-based RL methods. Our benchmark consists of 3 Interactive Dialogue tasks and 5 RL Capability tests for a total of 8 tasks, which require multiple rounds of language interaction and cover tasks in open-ended dialogue and text games. ... In Table 2 we present the results for each method on each of our text-game and interactive dialogue tasks. |
| Researcher Affiliation | Collaboration | 1University of California, Berkeley 2Google. Correspondence to: Marwa Abdulhai <marwa EMAIL>. |
| Pseudocode | No | The paper describes algorithms conceptually, for example, in Section 3 'Multi-Turn Generation with RL and Language Models', but it does not provide structured pseudocode or algorithm blocks. The methods are explained in prose without formal pseudocode representation. |
| Open Source Code | Yes | Our project page (https://lmrl-gym.github.io/) contains links to our open-sourced datasets (https: //rail.eecs.berkeley.edu/datasets/ rl-llm-bench-dataset/) and research framework (https://github.com/abdulhaim/LMRL-Gym). |
| Open Datasets | Yes | Our project page (https://lmrl-gym.github.io/) contains links to our open-sourced datasets (https: //rail.eecs.berkeley.edu/datasets/ rl-llm-bench-dataset/) and research framework (https://github.com/abdulhaim/LMRL-Gym). |
| Dataset Splits | Yes | For Wordle we define the environment to use a subset of 400 words from the official wordle vocabulary list. ... We generate 1 million trajectories for training and 100k trajectories for evaluation, using our suboptimal policy. ... The dataset we collect consists of 100K full conversations between the guesser and the oracle. |
| Hardware Specification | No | The paper mentions 'We choose GPT2 rather than a larger model due to memory and time constraints, though we admit larger models would lead to a performance boost.' However, it does not specify any particular GPU models, CPU types, or other hardware used for running the experiments. |
| Software Dependencies | Yes | We collect our data for the chess task using Stockfish 15.1 simulating the agent of various strengths play against another environment Stockfish engine with elo 1200 simulating the environment. |
| Experiment Setup | Yes | We report the hyperparameters that we used for each task in Appendix E. Table 5. Hyperparameters for baseline experiments. |