Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Minimax Optimal Reinforcement Learning with Quasi-Optimism

Authors: Harin Lee, Min-hwan Oh

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Empirical evaluations demonstrate that EQO consistently outperforms existing algorithms in both regret performance and computational efficiency, providing the best of both theoretical soundness and practical effectiveness. We perform numerical experiments to compare the empirical performance of algorithms for tabular reinforcement learning.
Researcher Affiliation Academia Harin Lee Seoul National University EMAIL Min-hwan Oh Seoul National University EMAIL
Pseudocode Yes Algorithm 1: EQO (Exploration via Quasi-Optimism)
Open Source Code Yes We also guarantee the reproducibility of the numerical experiments in Section 5 and Appendices G and H.2 by providing the source code with specific seeds as supplementary material.
Open Datasets Yes We consider the standard MDP named River Swim (Strehl & Littman, 2008; Osband et al., 2013)... We conduct additional experiments in two more complex environments: Atari freeway 10 fs30 (Bellemare et al., 2013) and Minigrid Mini Grid-Key Corridor S3R1-v0 (Chevalier Boisvert et al., 2023). We have obtained their tabularized versions from the BRIDGE dataset (Laidlaw et al., 2023).
Dataset Splits No The paper discusses performing experiments over a certain number of 'episodes' and 'runs' (e.g., '10 runs of 100,000 episodes') which is typical for reinforcement learning environments. However, it does not describe traditional training, validation, or test splits for a static dataset, as the environments are interactive rather than pre-split datasets.
Hardware Specification No The paper describes numerical experiments and execution times but does not specify any particular hardware used for these experiments, such as GPU or CPU models, or other computer specifications.
Software Dependencies No The paper mentions various algorithms and environments (e.g., UCRL2, UCBVI-BF, River Swim, Atari) but does not provide specific version numbers for any software dependencies, programming languages, or libraries used in their implementation.
Experiment Setup No All parameters are set according to the algorithms theoretical values as described in their respective papers. For EQO, the parameters are set as described in Theorem 2, where the algorithm is unaware of the number of episodes. For the tuned version, it states 'a multiplicative factor for the whole bonus term is set as a tuning parameter', but the specific values used in the experiments are not explicitly listed.