Beyond Uniform Sampling: Offline Reinforcement Learning with Imbalanced Datasets
Authors: Zhang-Wei Hong, Aviral Kumar, Sathwik Karnik, Abhishek Bhandwaldar, Akash Srivastava, Joni Pajarinen, Romain Laroche, Abhishek Gupta, Pulkit Agrawal
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our evaluation demonstrates significant performance gains in 72 imbalanced datasets, D4RL dataset, and across three different offline RL algorithms. |
| Researcher Affiliation | Collaboration | Correspondence: zwhong@mit.edu, Improbable AI Lab, Massachusetts Institute of Technology1, RAIL Lab, UC Berkeley2, MIT-IBM Lab3, Aalto University4, University of Washington5, and independent researcher6. |
| Pseudocode | Yes | Algorithm 1 Density-ratio weighting with generic offline RL algorithms (details in Appendix A.3) |
| Open Source Code | Yes | Code is available at https://github.com/Improbable-AI/dw-offline-rl. |
| Open Datasets | Yes | Following the protocol in prior offline RL benchmarking [6], we develop representative datasets of each type using the locomotion tasks from the D4RL Gym suite. Our datasets are generated by combining 1 σ% of trajectories from the random-v2 dataset (low-performing) and σ% of trajectories from the medium-v2 or expert-v2 dataset (high-performing) for each locomotion environment in the D4RL benchmark. |
| Dataset Splits | No | The paper does not explicitly specify train/validation/test dataset splits. It mentions training for a certain number of gradient steps and evaluating the policy in the environment, but not a specific dataset split for validation. |
| Hardware Specification | Yes | We ran all the experiments using workstations with two RTX 3090 GPUs, AMD Ryzen Threadripper PRO 3995WX 64-Cores CPU, and 256GB RAM. |
| Software Dependencies | No | The paper mentions using "Jax CQL" and refers to "official implementation for implicit Q-learning (IQL)", and mentions the "Adam optimizer", but it does not specify version numbers for any of these software dependencies. |
| Experiment Setup | Yes | To minimize the objective defined in Equation 25, we train ϕ and ψ using the Adam optimizer [16] with a learning rate of 0.0001 and a batch size of 256. For DW-AW and DW-Uniform, we searched λK {0.2, 1.0} and λF {0.1, 1.0, 5.0}. We use the best found hyperparameter by the time we started the large scale experiments: (λK, λF ) = (0.2, 0.1) for CQL and (λK, λF ) = (1.0, 1.0) for IQL. |