Learning from Sparse Offline Datasets via Conservative Density Estimation

Authors: Zhepeng Cen, Zuxin Liu, Zitong Wang, Yihang Yao, Henry Lam, Ding Zhao

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this section, we aim to study if CDE can truly combine the advantages of both pessimism-based methods and the DICE-based approaches. We are particularly interested in two main questions: (1) Does CDE incorporate the strengths of the stationary-distribution correction training framework when handling sparse reward settings? (2) Can CDE s explicit density constraint effectively manage out-of-distribution (OOD) extrapolation issues in situations with insufficient datasets? Tasks. To answer these questions, we adopt 3 Maze2D datasets, 8 Adroit datasets, and 6 Mu Jo Co (medium, medium-expert) datasets from the D4RL benchmark (Fu et al., 2020).
Researcher Affiliation Academia 1Carnegie Mellon University, 2 Columbia University
Pseudocode Yes Algorithm 1 Conservative Density Estimation
Open Source Code Yes Code is available at https://github.com/czp16/cde-offline-rl.
Open Datasets Yes We adopt 3 Maze2D datasets, 8 Adroit datasets, and 6 Mu Jo Co (medium, medium-expert) datasets from the D4RL benchmark (Fu et al., 2020).
Dataset Splits No The paper describes using sub-datasets for comparative experiments and evaluation processes, but it does not explicitly provide details about standard train/validation/test dataset splits or a validation set.
Hardware Specification Yes We use the server with AMD EPYC 7542 32-Core CPU and A5000 GPU.
Software Dependencies No The paper mentions using 'Adam' optimizer and the 'd3rlpy library' for baselines, but it does not specify version numbers for these or other software components like Python or PyTorch, which would be needed for reproducibility.
Experiment Setup Yes Before training NN, we standardize the observation and reward and scale the reward by multiplying 0.1. [...] Table 5: The shared hyperparameters. Hyperparameters values hidden layers of policy πθ [256,256] [...] NN learning rate 3e-4 discount factor γ 0.99 batch size 512 mixture coefficient ζ 0.9 max OOD IS ratio ϵ 0.3 number of OOD action samples 5