Position: Benchmarking is Limited in Reinforcement Learning Research
Authors: Scott M. Jordan, Adam White, Bruno Castro Da Silva, Martha White, Philip S. Thomas
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | This work investigates the sources of increased computation costs in rigorous experiment designs. We show that conducting rigorous performance benchmarks will likely have computational costs that are often prohibitive. |
| Researcher Affiliation | Academia | 1University of Alberta 2Canada Cifar AI Chair 3Alberta Machine Intelligence Institute 4University of Massachusetts. |
| Pseudocode | No | No structured pseudocode or algorithm blocks were found in the paper. |
| Open Source Code | No | No explicit statements or links indicating the release of open-source code for the methodology described in this paper were found. The paper refers to algorithms and specifications from Jordan et al. (2020) but does not provide its own code. |
| Open Datasets | Yes | We use a variant of the four rooms MDP (Sutton et al., 1999) where there are two goal states: one ten steps away from the start state yielding a reward of 5, and one at 17 steps away from the start state yielding a reward of 10. |
| Dataset Splits | No | The paper refers to "sample sizes" for statistical evaluation and bootstrapping (e.g., "1,000 datasets for different sample sizes, (10, 25, 50, 100, 500, 1000), by sampling with replacement from the empirical distribution.") but does not provide explicit training, validation, or test dataset splits in the conventional machine learning sense for model training. |
| Hardware Specification | No | No specific hardware details (e.g., GPU/CPU models, memory specifications) used for running experiments were provided in the paper. |
| Software Dependencies | No | No specific ancillary software details with version numbers (e.g., library or solver names with versions) needed to replicate the experiment were provided. The paper mentions hyperparameters (ϵ = 0.02, λ = 0.9) but not software versions. |
| Experiment Setup | Yes | To measure the impact of each configuration, we will use eight standard algorithms and fourteen classic benchmark environments (listed in Appendix B). To measure the reliability of the confidence intervals, we need to have ground truth information about each algorithm s performance. Since we do not know the performance distribution for each algorithm, we will approximate it and treat the approximate distribution as the ground truth. We create the approximate distribution for each Xi,j from approximately 334,000 executions of each algorithm-environment pair. We then create 1,000 datasets for different sample sizes, (10, 25, 50, 100, 500, 1000), by sampling with replacement from the empirical distribution. We treat each dataset as a single trial of the evaluation procedure. To compute 95% confidence intervals, we use the percentile bootstrap technique with 10,000 bootstrap samples... |