Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Minimax Optimal Reinforcement Learning with Quasi-Optimism
Authors: Harin Lee, Min-hwan Oh
ICLR 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Empirical evaluations demonstrate that EQO consistently outperforms existing algorithms in both regret performance and computational efficiency, providing the best of both theoretical soundness and practical effectiveness. We perform numerical experiments to compare the empirical performance of algorithms for tabular reinforcement learning. |
| Researcher Affiliation | Academia | Harin Lee Seoul National University EMAIL Min-hwan Oh Seoul National University EMAIL |
| Pseudocode | Yes | Algorithm 1: EQO (Exploration via Quasi-Optimism) |
| Open Source Code | Yes | We also guarantee the reproducibility of the numerical experiments in Section 5 and Appendices G and H.2 by providing the source code with specific seeds as supplementary material. |
| Open Datasets | Yes | We consider the standard MDP named River Swim (Strehl & Littman, 2008; Osband et al., 2013)... We conduct additional experiments in two more complex environments: Atari freeway 10 fs30 (Bellemare et al., 2013) and Minigrid Mini Grid-Key Corridor S3R1-v0 (Chevalier Boisvert et al., 2023). We have obtained their tabularized versions from the BRIDGE dataset (Laidlaw et al., 2023). |
| Dataset Splits | No | The paper discusses performing experiments over a certain number of 'episodes' and 'runs' (e.g., '10 runs of 100,000 episodes') which is typical for reinforcement learning environments. However, it does not describe traditional training, validation, or test splits for a static dataset, as the environments are interactive rather than pre-split datasets. |
| Hardware Specification | No | The paper describes numerical experiments and execution times but does not specify any particular hardware used for these experiments, such as GPU or CPU models, or other computer specifications. |
| Software Dependencies | No | The paper mentions various algorithms and environments (e.g., UCRL2, UCBVI-BF, River Swim, Atari) but does not provide specific version numbers for any software dependencies, programming languages, or libraries used in their implementation. |
| Experiment Setup | No | All parameters are set according to the algorithms theoretical values as described in their respective papers. For EQO, the parameters are set as described in Theorem 2, where the algorithm is unaware of the number of episodes. For the tuned version, it states 'a multiplicative factor for the whole bonus term is set as a tuning parameter', but the specific values used in the experiments are not explicitly listed. |