reproducibilityindex.ai

Position: Benchmarking is Limited in Reinforcement Learning Research

Authors: Scott M. Jordan, Adam White, Bruno Castro Da Silva, Martha White, Philip S. Thomas

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	This work investigates the sources of increased computation costs in rigorous experiment designs. We show that conducting rigorous performance benchmarks will likely have computational costs that are often prohibitive.
Researcher Affiliation	Academia	1University of Alberta 2Canada Cifar AI Chair 3Alberta Machine Intelligence Institute 4University of Massachusetts.
Pseudocode	No	No structured pseudocode or algorithm blocks were found in the paper.
Open Source Code	No	No explicit statements or links indicating the release of open-source code for the methodology described in this paper were found. The paper refers to algorithms and specifications from Jordan et al. (2020) but does not provide its own code.
Open Datasets	Yes	We use a variant of the four rooms MDP (Sutton et al., 1999) where there are two goal states: one ten steps away from the start state yielding a reward of 5, and one at 17 steps away from the start state yielding a reward of 10.
Dataset Splits	No	The paper refers to "sample sizes" for statistical evaluation and bootstrapping (e.g., "1,000 datasets for different sample sizes, (10, 25, 50, 100, 500, 1000), by sampling with replacement from the empirical distribution.") but does not provide explicit training, validation, or test dataset splits in the conventional machine learning sense for model training.
Hardware Specification	No	No specific hardware details (e.g., GPU/CPU models, memory specifications) used for running experiments were provided in the paper.
Software Dependencies	No	No specific ancillary software details with version numbers (e.g., library or solver names with versions) needed to replicate the experiment were provided. The paper mentions hyperparameters (ϵ = 0.02, λ = 0.9) but not software versions.
Experiment Setup	Yes	To measure the impact of each configuration, we will use eight standard algorithms and fourteen classic benchmark environments (listed in Appendix B). To measure the reliability of the confidence intervals, we need to have ground truth information about each algorithm s performance. Since we do not know the performance distribution for each algorithm, we will approximate it and treat the approximate distribution as the ground truth. We create the approximate distribution for each Xi,j from approximately 334,000 executions of each algorithm-environment pair. We then create 1,000 datasets for different sample sizes, (10, 25, 50, 100, 500, 1000), by sampling with replacement from the empirical distribution. We treat each dataset as a single trial of the evaluation procedure. To compute 95% confidence intervals, we use the percentile bootstrap technique with 10,000 bootstrap samples...