reproducibilityindex.ai

Beyond Uniform Sampling: Offline Reinforcement Learning with Imbalanced Datasets

Authors: Zhang-Wei Hong, Aviral Kumar, Sathwik Karnik, Abhishek Bhandwaldar, Akash Srivastava, Joni Pajarinen, Romain Laroche, Abhishek Gupta, Pulkit Agrawal

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our evaluation demonstrates significant performance gains in 72 imbalanced datasets, D4RL dataset, and across three different offline RL algorithms.
Researcher Affiliation	Collaboration	Correspondence: zwhong@mit.edu, Improbable AI Lab, Massachusetts Institute of Technology1, RAIL Lab, UC Berkeley2, MIT-IBM Lab3, Aalto University4, University of Washington5, and independent researcher6.
Pseudocode	Yes	Algorithm 1 Density-ratio weighting with generic offline RL algorithms (details in Appendix A.3)
Open Source Code	Yes	Code is available at https://github.com/Improbable-AI/dw-offline-rl.
Open Datasets	Yes	Following the protocol in prior offline RL benchmarking [6], we develop representative datasets of each type using the locomotion tasks from the D4RL Gym suite. Our datasets are generated by combining 1 σ% of trajectories from the random-v2 dataset (low-performing) and σ% of trajectories from the medium-v2 or expert-v2 dataset (high-performing) for each locomotion environment in the D4RL benchmark.
Dataset Splits	No	The paper does not explicitly specify train/validation/test dataset splits. It mentions training for a certain number of gradient steps and evaluating the policy in the environment, but not a specific dataset split for validation.
Hardware Specification	Yes	We ran all the experiments using workstations with two RTX 3090 GPUs, AMD Ryzen Threadripper PRO 3995WX 64-Cores CPU, and 256GB RAM.
Software Dependencies	No	The paper mentions using "Jax CQL" and refers to "official implementation for implicit Q-learning (IQL)", and mentions the "Adam optimizer", but it does not specify version numbers for any of these software dependencies.
Experiment Setup	Yes	To minimize the objective defined in Equation 25, we train ϕ and ψ using the Adam optimizer [16] with a learning rate of 0.0001 and a batch size of 256. For DW-AW and DW-Uniform, we searched λK {0.2, 1.0} and λF {0.1, 1.0, 5.0}. We use the best found hyperparameter by the time we started the large scale experiments: (λK, λF ) = (0.2, 0.1) for CQL and (λK, λF ) = (1.0, 1.0) for IQL.