Batch Reinforcement Learning with Hyperparameter Gradients

Authors: Byungjun Lee, Jongmin Lee, Peter Vrancx, Dongho Kim, Kee-Eung Kim

ICML 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We show that BOPAH outperforms other batch reinforcement learning algorithms in tabular and continuous control tasks, by finding a good balance to the trade-off between adhering to the data collection policy and pursuing the possible policy improvement.
Researcher Affiliation Collaboration 1School of Computing, KAIST, Daejeon, South Korea 2PROWLER.io 3Graduate School of AI, KAIST, Daejeon, South Korea.
Pseudocode No The paper describes the algorithms and procedures in paragraph text without a dedicated pseudocode or algorithm block.
Open Source Code No The paper states "We used their published code and hyperparameters (Φ = 0.05 for BCQ and ϵ = 0.05 for BEAR-QL) therein for obtaining experimental results," referring to third-party code, but does not provide concrete access to their own source code for BOPAH/AC-BOPAH.
Open Datasets Yes In this experiment, we evaluate the effectiveness of AC-BOPAH on continuous control tasks, using the MuJoCo environments in the Open AI gym (Todorov et al., 2012; Brockman et al., 2016).
Dataset Splits Yes BOPAH starts by dividing the entire batch data D = {(si, ai, s i, ri)}N i=1 into two mutually exclusive sets Dtrain and Dvalid.
Hardware Specification No No specific hardware details (e.g., CPU, GPU models, memory) used for experiments are provided in the paper.
Software Dependencies No The paper mentions software like "Open AI gym" and algorithms such as SAC, BCQ, and BEAR-QL, but does not provide specific version numbers for these or other software dependencies like deep learning frameworks.
Experiment Setup Yes We used their published code and hyperparameters (Φ = 0.05 for BCQ and ϵ = 0.05 for BEAR-QL) therein for obtaining experimental results.