Offline RL Policies Should Be Trained to be Adaptive
Authors: Dibya Ghosh, Anurag Ajay, Pulkit Agrawal, Sergey Levine
ICML 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | The primary aim of our experiments is to ascertain whether adaptability leads to improved performance in offline RL. Thus, we provide an evaluation on standard D4RL benchmark tasks (Fu et al., 2020) and two offline RL tasks that require handling ambiguity and generalization, Locked Doors and Procgen Mazes. |
| Researcher Affiliation | Academia | 1UC Berkeley 2MIT. Correspondence to: Dibya Ghosh <dibya@berkeley.edu>. |
| Pseudocode | Yes | Algorithm 1 Adaptive Policies with Ensembles of Value Functions (APE-V)Algorithm 2 APE-V Test-Time Adaptation |
| Open Source Code | No | The paper does not explicitly state that its own source code for the described methodology is released or provide a link to it. It only references a GitHub link for a third-party baseline they used. |
| Open Datasets | Yes | embedding CIFAR-10 into an offline RL navigation problem, Procgen benchmark (Cobbe et al., 2020), and D4RL benchmark (Fu et al., 2020). |
| Dataset Splits | No | The paper mentions training and testing phases but does not explicitly provide specific details about train/validation/test dataset splits (e.g., percentages or sample counts) needed for reproduction. |
| Hardware Specification | No | We thank MIT Supercloud and the Lincoln Laboratory Supercomputing Center for providing compute resources. This mentions general computing environments but lacks specific hardware details such as GPU/CPU models or memory. |
| Software Dependencies | No | The paper mentions algorithms and architectures (e.g., "Adam", "C51", "Impala encoder") but does not specify software library names with version numbers (e.g., "PyTorch 1.9", "Python 3.8") that would be needed for replication. |
| Experiment Setup | Yes | Table 3. Hyperparameters used for training Q learning based agents in Locked Doors domain. Hyperparameter Value: γ 0.98 batch size 256 learning rate 1e-3 Optimizer Adam (Kingma & Ba, 2014) Training steps 250k Number of ensembles 5 p(b) Symmetric Dirichlet(0.1)Table 4. Hyperparameters used for training Q learning based agents in Procgen Mazes domain. Hyperparameter Value: γ 0.99 Reward shift -1.0 Distributional support LINSPACE(-31, 9, 81) Batch size 256 Learning rate 6.25e-5 Optimizer Adam (Kingma & Ba, 2014) Training steps 106 Number of ensembles 2 p(b) Symmetric Dirichlet(1.0) |