Harnessing Mixed Offline Reinforcement Learning Datasets via Trajectory Weighting
Authors: Zhang-Wei Hong, Pulkit Agrawal, Remi Tachet des Combes, Romain Laroche
ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We empirically show that while CQL, IQL, and TD3+BC achieve only a part of this potential policy improvement, these same algorithms combined with our reweighted sampling strategy fully exploit the dataset. Furthermore, we empirically demonstrate that, despite its theoretical limitation, the approach may still be efficient in stochastic environments. The code is available at https://github.com/Improbable-AI/harness-offline-rl. |
| Researcher Affiliation | Collaboration | Zhang-Wei Hong & Pulkit Agrawal Massachusetts Institute of Technology USA {zwhong,pulkitag}@mit.edu Remi Tachet des Combes & Romain Laroche remi.tachet@gmail.com romain.laroche@gmail.com Work done while at Microsoft Research Montreal. |
| Pseudocode | No | The paper describes methods and equations but does not include any explicitly labeled pseudocode blocks or algorithms. |
| Open Source Code | Yes | The code is available at https://github.com/Improbable-AI/harness-offline-rl. We have included the implementation details in Appendix A.7 and the source code in the supplementary material. |
| Open Datasets | Yes | We evaluate our sampling strategies on three state-of-the-art offline RL algorithms, CQL, IQL, and TD3+BC (Kumar et al., 2020b; Kostrikov et al., 2022; Fujimoto & Gu, 2021), as well as behavior cloning, over 62 datasets in D4RL benchmarks (Fu et al., 2020). |
| Dataset Splits | No | The paper mentions training and testing but does not explicitly specify a validation dataset split or how validation data was used for hyperparameter tuning or early stopping. |
| Hardware Specification | Yes | We are grateful to MIT Supercloud and the Lincoln Laboratory Supercomputing Center for providing HPC resources. |
| Software Dependencies | No | The paper mentions software like 'd3rlpy (Takuma Seno, 2021)' and 'Open AI gym (Brockman et al., 2016)' but does not provide specific version numbers for these or other relevant libraries like Python, PyTorch, or TensorFlow, which are necessary for full reproducibility. |
| Experiment Setup | Yes | An algorithm+sampler variant is trained for one million batches of updates in five random seeds for each dataset and environment. For RW and AW, we use α = 0.1 for IQL and TD3+BC, and α = 0.2 for CQL. |