Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Offline RL Policies Should Be Trained to be Adaptive
Authors: Dibya Ghosh, Anurag Ajay, Pulkit Agrawal, Sergey Levine
ICML 2022 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | The primary aim of our experiments is to ascertain whether adaptability leads to improved performance in offline RL. Thus, we provide an evaluation on standard D4RL benchmark tasks (Fu et al., 2020) and two offline RL tasks that require handling ambiguity and generalization, Locked Doors and Procgen Mazes. |
| Researcher Affiliation | Academia | 1UC Berkeley 2MIT. Correspondence to: Dibya Ghosh <EMAIL>. |
| Pseudocode | Yes | Algorithm 1 Adaptive Policies with Ensembles of Value Functions (APE-V)Algorithm 2 APE-V Test-Time Adaptation |
| Open Source Code | No | The paper does not explicitly state that its own source code for the described methodology is released or provide a link to it. It only references a GitHub link for a third-party baseline they used. |
| Open Datasets | Yes | embedding CIFAR-10 into an offline RL navigation problem, Procgen benchmark (Cobbe et al., 2020), and D4RL benchmark (Fu et al., 2020). |
| Dataset Splits | No | The paper mentions training and testing phases but does not explicitly provide specific details about train/validation/test dataset splits (e.g., percentages or sample counts) needed for reproduction. |
| Hardware Specification | No | We thank MIT Supercloud and the Lincoln Laboratory Supercomputing Center for providing compute resources. This mentions general computing environments but lacks specific hardware details such as GPU/CPU models or memory. |
| Software Dependencies | No | The paper mentions algorithms and architectures (e.g., "Adam", "C51", "Impala encoder") but does not specify software library names with version numbers (e.g., "PyTorch 1.9", "Python 3.8") that would be needed for replication. |
| Experiment Setup | Yes | Table 3. Hyperparameters used for training Q learning based agents in Locked Doors domain. Hyperparameter Value: γ 0.98 batch size 256 learning rate 1e-3 Optimizer Adam (Kingma & Ba, 2014) Training steps 250k Number of ensembles 5 p(b) Symmetric Dirichlet(0.1)Table 4. Hyperparameters used for training Q learning based agents in Procgen Mazes domain. Hyperparameter Value: γ 0.99 Reward shift -1.0 Distributional support LINSPACE(-31, 9, 81) Batch size 256 Learning rate 6.25e-5 Optimizer Adam (Kingma & Ba, 2014) Training steps 106 Number of ensembles 2 p(b) Symmetric Dirichlet(1.0) |