Offline RL Policies Should Be Trained to be Adaptive

Authors: Dibya Ghosh, Anurag Ajay, Pulkit Agrawal, Sergey Levine

ICML 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental The primary aim of our experiments is to ascertain whether adaptability leads to improved performance in offline RL. Thus, we provide an evaluation on standard D4RL benchmark tasks (Fu et al., 2020) and two offline RL tasks that require handling ambiguity and generalization, Locked Doors and Procgen Mazes.
Researcher Affiliation Academia 1UC Berkeley 2MIT. Correspondence to: Dibya Ghosh <dibya@berkeley.edu>.
Pseudocode Yes Algorithm 1 Adaptive Policies with Ensembles of Value Functions (APE-V)Algorithm 2 APE-V Test-Time Adaptation
Open Source Code No The paper does not explicitly state that its own source code for the described methodology is released or provide a link to it. It only references a GitHub link for a third-party baseline they used.
Open Datasets Yes embedding CIFAR-10 into an offline RL navigation problem, Procgen benchmark (Cobbe et al., 2020), and D4RL benchmark (Fu et al., 2020).
Dataset Splits No The paper mentions training and testing phases but does not explicitly provide specific details about train/validation/test dataset splits (e.g., percentages or sample counts) needed for reproduction.
Hardware Specification No We thank MIT Supercloud and the Lincoln Laboratory Supercomputing Center for providing compute resources. This mentions general computing environments but lacks specific hardware details such as GPU/CPU models or memory.
Software Dependencies No The paper mentions algorithms and architectures (e.g., "Adam", "C51", "Impala encoder") but does not specify software library names with version numbers (e.g., "PyTorch 1.9", "Python 3.8") that would be needed for replication.
Experiment Setup Yes Table 3. Hyperparameters used for training Q learning based agents in Locked Doors domain. Hyperparameter Value: γ 0.98 batch size 256 learning rate 1e-3 Optimizer Adam (Kingma & Ba, 2014) Training steps 250k Number of ensembles 5 p(b) Symmetric Dirichlet(0.1)Table 4. Hyperparameters used for training Q learning based agents in Procgen Mazes domain. Hyperparameter Value: γ 0.99 Reward shift -1.0 Distributional support LINSPACE(-31, 9, 81) Batch size 256 Learning rate 6.25e-5 Optimizer Adam (Kingma & Ba, 2014) Training steps 106 Number of ensembles 2 p(b) Symmetric Dirichlet(1.0)