Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

LMRL Gym: Benchmarks for Multi-Turn Reinforcement Learning with Language Models

Authors: Marwa Abdulhai, Isadora White, Charlie Victor Snell, Charles Sun, Joey Hong, Yuexiang Zhai, Kelvin Xu, Sergey Levine

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our paper introduces the LMRL-Gym benchmark for evaluating multi-turn RL for LLMs, together with an open-source research framework for getting started on multi-turn RL with offline value-based and online policy-based RL methods. Our benchmark consists of 3 Interactive Dialogue tasks and 5 RL Capability tests for a total of 8 tasks, which require multiple rounds of language interaction and cover tasks in open-ended dialogue and text games. ... In Table 2 we present the results for each method on each of our text-game and interactive dialogue tasks.
Researcher Affiliation	Collaboration	1University of California, Berkeley 2Google. Correspondence to: Marwa Abdulhai <marwa EMAIL>.
Pseudocode	No	The paper describes algorithms conceptually, for example, in Section 3 'Multi-Turn Generation with RL and Language Models', but it does not provide structured pseudocode or algorithm blocks. The methods are explained in prose without formal pseudocode representation.
Open Source Code	Yes	Our project page (https://lmrl-gym.github.io/) contains links to our open-sourced datasets (https: //rail.eecs.berkeley.edu/datasets/ rl-llm-bench-dataset/) and research framework (https://github.com/abdulhaim/LMRL-Gym).
Open Datasets	Yes	Our project page (https://lmrl-gym.github.io/) contains links to our open-sourced datasets (https: //rail.eecs.berkeley.edu/datasets/ rl-llm-bench-dataset/) and research framework (https://github.com/abdulhaim/LMRL-Gym).
Dataset Splits	Yes	For Wordle we define the environment to use a subset of 400 words from the official wordle vocabulary list. ... We generate 1 million trajectories for training and 100k trajectories for evaluation, using our suboptimal policy. ... The dataset we collect consists of 100K full conversations between the guesser and the oracle.
Hardware Specification	No	The paper mentions 'We choose GPT2 rather than a larger model due to memory and time constraints, though we admit larger models would lead to a performance boost.' However, it does not specify any particular GPU models, CPU types, or other hardware used for running the experiments.
Software Dependencies	Yes	We collect our data for the chess task using Stockfish 15.1 simulating the agent of various strengths play against another environment Stockfish engine with elo 1200 simulating the environment.
Experiment Setup	Yes	We report the hyperparameters that we used for each task in Appendix E. Table 5. Hyperparameters for baseline experiments.