Harnessing Mixed Offline Reinforcement Learning Datasets via Trajectory Weighting

Authors: Zhang-Wei Hong, Pulkit Agrawal, Remi Tachet des Combes, Romain Laroche

ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We empirically show that while CQL, IQL, and TD3+BC achieve only a part of this potential policy improvement, these same algorithms combined with our reweighted sampling strategy fully exploit the dataset. Furthermore, we empirically demonstrate that, despite its theoretical limitation, the approach may still be efficient in stochastic environments. The code is available at https://github.com/Improbable-AI/harness-offline-rl.
Researcher Affiliation Collaboration Zhang-Wei Hong & Pulkit Agrawal Massachusetts Institute of Technology USA {zwhong,pulkitag}@mit.edu Remi Tachet des Combes & Romain Laroche remi.tachet@gmail.com romain.laroche@gmail.com Work done while at Microsoft Research Montreal.
Pseudocode No The paper describes methods and equations but does not include any explicitly labeled pseudocode blocks or algorithms.
Open Source Code Yes The code is available at https://github.com/Improbable-AI/harness-offline-rl. We have included the implementation details in Appendix A.7 and the source code in the supplementary material.
Open Datasets Yes We evaluate our sampling strategies on three state-of-the-art offline RL algorithms, CQL, IQL, and TD3+BC (Kumar et al., 2020b; Kostrikov et al., 2022; Fujimoto & Gu, 2021), as well as behavior cloning, over 62 datasets in D4RL benchmarks (Fu et al., 2020).
Dataset Splits No The paper mentions training and testing but does not explicitly specify a validation dataset split or how validation data was used for hyperparameter tuning or early stopping.
Hardware Specification Yes We are grateful to MIT Supercloud and the Lincoln Laboratory Supercomputing Center for providing HPC resources.
Software Dependencies No The paper mentions software like 'd3rlpy (Takuma Seno, 2021)' and 'Open AI gym (Brockman et al., 2016)' but does not provide specific version numbers for these or other relevant libraries like Python, PyTorch, or TensorFlow, which are necessary for full reproducibility.
Experiment Setup Yes An algorithm+sampler variant is trained for one million batches of updates in five random seeds for each dataset and environment. For RW and AW, we use α = 0.1 for IQL and TD3+BC, and α = 0.2 for CQL.