Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

LMRL Gym: Benchmarks for Multi-Turn Reinforcement Learning with Language Models

Authors: Marwa Abdulhai, Isadora White, Charlie Victor Snell, Charles Sun, Joey Hong, Yuexiang Zhai, Kelvin Xu, Sergey Levine

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our paper introduces the LMRL-Gym benchmark for evaluating multi-turn RL for LLMs, together with an open-source research framework for getting started on multi-turn RL with offline value-based and online policy-based RL methods. Our benchmark consists of 3 Interactive Dialogue tasks and 5 RL Capability tests for a total of 8 tasks, which require multiple rounds of language interaction and cover tasks in open-ended dialogue and text games. ... In Table 2 we present the results for each method on each of our text-game and interactive dialogue tasks.
Researcher Affiliation Collaboration 1University of California, Berkeley 2Google. Correspondence to: Marwa Abdulhai <marwa EMAIL>.
Pseudocode No The paper describes algorithms conceptually, for example, in Section 3 'Multi-Turn Generation with RL and Language Models', but it does not provide structured pseudocode or algorithm blocks. The methods are explained in prose without formal pseudocode representation.
Open Source Code Yes Our project page (https://lmrl-gym.github.io/) contains links to our open-sourced datasets (https: //rail.eecs.berkeley.edu/datasets/ rl-llm-bench-dataset/) and research framework (https://github.com/abdulhaim/LMRL-Gym).
Open Datasets Yes Our project page (https://lmrl-gym.github.io/) contains links to our open-sourced datasets (https: //rail.eecs.berkeley.edu/datasets/ rl-llm-bench-dataset/) and research framework (https://github.com/abdulhaim/LMRL-Gym).
Dataset Splits Yes For Wordle we define the environment to use a subset of 400 words from the official wordle vocabulary list. ... We generate 1 million trajectories for training and 100k trajectories for evaluation, using our suboptimal policy. ... The dataset we collect consists of 100K full conversations between the guesser and the oracle.
Hardware Specification No The paper mentions 'We choose GPT2 rather than a larger model due to memory and time constraints, though we admit larger models would lead to a performance boost.' However, it does not specify any particular GPU models, CPU types, or other hardware used for running the experiments.
Software Dependencies Yes We collect our data for the chess task using Stockfish 15.1 simulating the agent of various strengths play against another environment Stockfish engine with elo 1200 simulating the environment.
Experiment Setup Yes We report the hyperparameters that we used for each task in Appendix E. Table 5. Hyperparameters for baseline experiments.