Learning from Sparse Offline Datasets via Conservative Density Estimation
Authors: Zhepeng Cen, Zuxin Liu, Zitong Wang, Yihang Yao, Henry Lam, Ding Zhao
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this section, we aim to study if CDE can truly combine the advantages of both pessimism-based methods and the DICE-based approaches. We are particularly interested in two main questions: (1) Does CDE incorporate the strengths of the stationary-distribution correction training framework when handling sparse reward settings? (2) Can CDE s explicit density constraint effectively manage out-of-distribution (OOD) extrapolation issues in situations with insufficient datasets? Tasks. To answer these questions, we adopt 3 Maze2D datasets, 8 Adroit datasets, and 6 Mu Jo Co (medium, medium-expert) datasets from the D4RL benchmark (Fu et al., 2020). |
| Researcher Affiliation | Academia | 1Carnegie Mellon University, 2 Columbia University |
| Pseudocode | Yes | Algorithm 1 Conservative Density Estimation |
| Open Source Code | Yes | Code is available at https://github.com/czp16/cde-offline-rl. |
| Open Datasets | Yes | We adopt 3 Maze2D datasets, 8 Adroit datasets, and 6 Mu Jo Co (medium, medium-expert) datasets from the D4RL benchmark (Fu et al., 2020). |
| Dataset Splits | No | The paper describes using sub-datasets for comparative experiments and evaluation processes, but it does not explicitly provide details about standard train/validation/test dataset splits or a validation set. |
| Hardware Specification | Yes | We use the server with AMD EPYC 7542 32-Core CPU and A5000 GPU. |
| Software Dependencies | No | The paper mentions using 'Adam' optimizer and the 'd3rlpy library' for baselines, but it does not specify version numbers for these or other software components like Python or PyTorch, which would be needed for reproducibility. |
| Experiment Setup | Yes | Before training NN, we standardize the observation and reward and scale the reward by multiplying 0.1. [...] Table 5: The shared hyperparameters. Hyperparameters values hidden layers of policy πθ [256,256] [...] NN learning rate 3e-4 discount factor γ 0.99 batch size 512 mixture coefficient ζ 0.9 max OOD IS ratio ϵ 0.3 number of OOD action samples 5 |