Towards Hyperparameter-free Policy Selection for Offline Reinforcement Learning
Authors: Siyuan Zhang, Nan Jiang
NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We perform empirical evaluation on Open AI Gym [Bro+16], Atari games [BNVB13], and Mujoco [TET12]. ... For each algorithm, we consider different neural architectures, learning rates, and learning steps as hyperparameters to produce multiple candidate policies (and value functions) for selection; see Table 1 in Appendix C for details. |
| Researcher Affiliation | Academia | Siyuan Zhang Computer Science University of Illinois at Urbana-Champaign siyuan3@illinois.edu Nan Jiang Computer Science University of Illinois at Urbana-Champaign nanjiang@illinois.edu |
| Pseudocode | Yes | Based on this novel observation, we propose to search for a grid of discretization errors in BVFT and pick the resolution that minimizes the loss (Eq.(2)); see pseudocode in Appendix A. |
| Open Source Code | No | The paper does not provide an explicit statement or link for the open-sourcing of their code for the described methodology. |
| Open Datasets | Yes | We use standard offline datasets when available (RLUnplugged [Gul+21] for Atari, and D4RL [FKNTL21] for Mu Jo Co)... |
| Dataset Splits | No | The paper mentions re-sampling a subset of the dataset for policy selection (usually of size 50,000) for its evaluation, but does not explicitly describe train/validation/test splits for model training or for the data used in their method's evaluation in a reproducible manner. |
| Hardware Specification | No | The paper does not provide specific details about the hardware (e.g., CPU, GPU models, memory) used for running the experiments. |
| Software Dependencies | No | The paper does not list specific software dependencies with version numbers (e.g., Python 3.x, PyTorch 1.x) needed to replicate the experiment. |
| Experiment Setup | Yes | For each algorithm, we consider different neural architectures, learning rates, and learning steps as hyperparameters to produce multiple candidate policies (and value functions) for selection; see Table 1 in Appendix C for details. ... we propose to search for a grid of discretization errors in BVFT and pick the resolution that minimizes the loss (Eq.(2)); see pseudocode in Appendix A. ... Strategy 1 (using BVFT-PE-Q) slightly outperforms Strategy 2, but comes with an additional hyperparameter λ; we tuned it on Hopper and use the same constant in all experiments. |