reproducibilityindex.ai

Offline RL Policies Should Be Trained to be Adaptive

Authors: Dibya Ghosh, Anurag Ajay, Pulkit Agrawal, Sergey Levine

ICML 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	The primary aim of our experiments is to ascertain whether adaptability leads to improved performance in offline RL. Thus, we provide an evaluation on standard D4RL benchmark tasks (Fu et al., 2020) and two offline RL tasks that require handling ambiguity and generalization, Locked Doors and Procgen Mazes.
Researcher Affiliation	Academia	1UC Berkeley 2MIT. Correspondence to: Dibya Ghosh <dibya@berkeley.edu>.
Pseudocode	Yes	Algorithm 1 Adaptive Policies with Ensembles of Value Functions (APE-V)Algorithm 2 APE-V Test-Time Adaptation
Open Source Code	No	The paper does not explicitly state that its own source code for the described methodology is released or provide a link to it. It only references a GitHub link for a third-party baseline they used.
Open Datasets	Yes	embedding CIFAR-10 into an offline RL navigation problem, Procgen benchmark (Cobbe et al., 2020), and D4RL benchmark (Fu et al., 2020).
Dataset Splits	No	The paper mentions training and testing phases but does not explicitly provide specific details about train/validation/test dataset splits (e.g., percentages or sample counts) needed for reproduction.
Hardware Specification	No	We thank MIT Supercloud and the Lincoln Laboratory Supercomputing Center for providing compute resources. This mentions general computing environments but lacks specific hardware details such as GPU/CPU models or memory.
Software Dependencies	No	The paper mentions algorithms and architectures (e.g., "Adam", "C51", "Impala encoder") but does not specify software library names with version numbers (e.g., "PyTorch 1.9", "Python 3.8") that would be needed for replication.
Experiment Setup	Yes	Table 3. Hyperparameters used for training Q learning based agents in Locked Doors domain. Hyperparameter Value: γ 0.98 batch size 256 learning rate 1e-3 Optimizer Adam (Kingma & Ba, 2014) Training steps 250k Number of ensembles 5 p(b) Symmetric Dirichlet(0.1)Table 4. Hyperparameters used for training Q learning based agents in Procgen Mazes domain. Hyperparameter Value: γ 0.99 Reward shift -1.0 Distributional support LINSPACE(-31, 9, 81) Batch size 256 Learning rate 6.25e-5 Optimizer Adam (Kingma & Ba, 2014) Training steps 106 Number of ensembles 2 p(b) Symmetric Dirichlet(1.0)