Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Minimax Optimal Reinforcement Learning with Quasi-Optimism

Authors: Harin Lee, Min-hwan Oh

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Empirical evaluations demonstrate that EQO consistently outperforms existing algorithms in both regret performance and computational efficiency, providing the best of both theoretical soundness and practical effectiveness. We perform numerical experiments to compare the empirical performance of algorithms for tabular reinforcement learning.
Researcher Affiliation	Academia	Harin Lee Seoul National University EMAIL Min-hwan Oh Seoul National University EMAIL
Pseudocode	Yes	Algorithm 1: EQO (Exploration via Quasi-Optimism)
Open Source Code	Yes	We also guarantee the reproducibility of the numerical experiments in Section 5 and Appendices G and H.2 by providing the source code with specific seeds as supplementary material.
Open Datasets	Yes	We consider the standard MDP named River Swim (Strehl & Littman, 2008; Osband et al., 2013)... We conduct additional experiments in two more complex environments: Atari freeway 10 fs30 (Bellemare et al., 2013) and Minigrid Mini Grid-Key Corridor S3R1-v0 (Chevalier Boisvert et al., 2023). We have obtained their tabularized versions from the BRIDGE dataset (Laidlaw et al., 2023).
Dataset Splits	No	The paper discusses performing experiments over a certain number of 'episodes' and 'runs' (e.g., '10 runs of 100,000 episodes') which is typical for reinforcement learning environments. However, it does not describe traditional training, validation, or test splits for a static dataset, as the environments are interactive rather than pre-split datasets.
Hardware Specification	No	The paper describes numerical experiments and execution times but does not specify any particular hardware used for these experiments, such as GPU or CPU models, or other computer specifications.
Software Dependencies	No	The paper mentions various algorithms and environments (e.g., UCRL2, UCBVI-BF, River Swim, Atari) but does not provide specific version numbers for any software dependencies, programming languages, or libraries used in their implementation.
Experiment Setup	No	All parameters are set according to the algorithms theoretical values as described in their respective papers. For EQO, the parameters are set as described in Theorem 2, where the algorithm is unaware of the number of episodes. For the tuned version, it states 'a multiplicative factor for the whole bonus term is set as a tuning parameter', but the specific values used in the experiments are not explicitly listed.