Beyond Uniform Sampling: Offline Reinforcement Learning with Imbalanced Datasets

Authors: Zhang-Wei Hong, Aviral Kumar, Sathwik Karnik, Abhishek Bhandwaldar, Akash Srivastava, Joni Pajarinen, Romain Laroche, Abhishek Gupta, Pulkit Agrawal

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our evaluation demonstrates significant performance gains in 72 imbalanced datasets, D4RL dataset, and across three different offline RL algorithms.
Researcher Affiliation Collaboration Correspondence: zwhong@mit.edu, Improbable AI Lab, Massachusetts Institute of Technology1, RAIL Lab, UC Berkeley2, MIT-IBM Lab3, Aalto University4, University of Washington5, and independent researcher6.
Pseudocode Yes Algorithm 1 Density-ratio weighting with generic offline RL algorithms (details in Appendix A.3)
Open Source Code Yes Code is available at https://github.com/Improbable-AI/dw-offline-rl.
Open Datasets Yes Following the protocol in prior offline RL benchmarking [6], we develop representative datasets of each type using the locomotion tasks from the D4RL Gym suite. Our datasets are generated by combining 1 σ% of trajectories from the random-v2 dataset (low-performing) and σ% of trajectories from the medium-v2 or expert-v2 dataset (high-performing) for each locomotion environment in the D4RL benchmark.
Dataset Splits No The paper does not explicitly specify train/validation/test dataset splits. It mentions training for a certain number of gradient steps and evaluating the policy in the environment, but not a specific dataset split for validation.
Hardware Specification Yes We ran all the experiments using workstations with two RTX 3090 GPUs, AMD Ryzen Threadripper PRO 3995WX 64-Cores CPU, and 256GB RAM.
Software Dependencies No The paper mentions using "Jax CQL" and refers to "official implementation for implicit Q-learning (IQL)", and mentions the "Adam optimizer", but it does not specify version numbers for any of these software dependencies.
Experiment Setup Yes To minimize the objective defined in Equation 25, we train ϕ and ψ using the Adam optimizer [16] with a learning rate of 0.0001 and a batch size of 256. For DW-AW and DW-Uniform, we searched λK {0.2, 1.0} and λF {0.1, 1.0, 5.0}. We use the best found hyperparameter by the time we started the large scale experiments: (λK, λF ) = (0.2, 0.1) for CQL and (λK, λF ) = (1.0, 1.0) for IQL.