Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Beyond Uniform Sampling: Offline Reinforcement Learning with Imbalanced Datasets
Authors: Zhang-Wei Hong, Aviral Kumar, Sathwik Karnik, Abhishek Bhandwaldar, Akash Srivastava, Joni Pajarinen, Romain Laroche, Abhishek Gupta, Pulkit Agrawal
NeurIPS 2023 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our evaluation demonstrates significant performance gains in 72 imbalanced datasets, D4RL dataset, and across three different offline RL algorithms. |
| Researcher Affiliation | Collaboration | Correspondence: EMAIL, Improbable AI Lab, Massachusetts Institute of Technology1, RAIL Lab, UC Berkeley2, MIT-IBM Lab3, Aalto University4, University of Washington5, and independent researcher6. |
| Pseudocode | Yes | Algorithm 1 Density-ratio weighting with generic offline RL algorithms (details in Appendix A.3) |
| Open Source Code | Yes | Code is available at https://github.com/Improbable-AI/dw-offline-rl. |
| Open Datasets | Yes | Following the protocol in prior offline RL benchmarking [6], we develop representative datasets of each type using the locomotion tasks from the D4RL Gym suite. Our datasets are generated by combining 1 σ% of trajectories from the random-v2 dataset (low-performing) and σ% of trajectories from the medium-v2 or expert-v2 dataset (high-performing) for each locomotion environment in the D4RL benchmark. |
| Dataset Splits | No | The paper does not explicitly specify train/validation/test dataset splits. It mentions training for a certain number of gradient steps and evaluating the policy in the environment, but not a specific dataset split for validation. |
| Hardware Specification | Yes | We ran all the experiments using workstations with two RTX 3090 GPUs, AMD Ryzen Threadripper PRO 3995WX 64-Cores CPU, and 256GB RAM. |
| Software Dependencies | No | The paper mentions using "Jax CQL" and refers to "official implementation for implicit Q-learning (IQL)", and mentions the "Adam optimizer", but it does not specify version numbers for any of these software dependencies. |
| Experiment Setup | Yes | To minimize the objective defined in Equation 25, we train ϕ and ψ using the Adam optimizer [16] with a learning rate of 0.0001 and a batch size of 256. For DW-AW and DW-Uniform, we searched λK {0.2, 1.0} and λF {0.1, 1.0, 5.0}. We use the best found hyperparameter by the time we started the large scale experiments: (λK, λF ) = (0.2, 0.1) for CQL and (λK, λF ) = (1.0, 1.0) for IQL. |