Conservative State Value Estimation for Offline Reinforcement Learning
Authors: Liting Chen, Jie Yan, Zhengdao Shao, Lu Wang, Qingwei Lin, Saravanakumar Rajmohan, Thomas Moscibroda, Dongmei Zhang
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate in classic continual control tasks of D4RL, showing that our method performs better than the conservative Q-function learning methods and is strongly competitive among recent SOTA methods. ... Experimental evaluation on continuous control tasks of Gym [7] and Adroit [8] in D4RL [9] benchmarks, showing that CSVE performs better than prior methods based on conservative Q-value estimation, and is strongly competitive among main SOTA algorithms. |
| Researcher Affiliation | Collaboration | Liting Chen Mc Gill University Montreal, Canada 98chenliting@gmail.com Jie Yan Microsoft Beijing, China dasistyanjie@gmail.com Zhengdao Shao University of Sci. and Tech. of China Hefei, China zhengdaoshao@mail.ustc.edu.cn Lu Wang Microsoft Beijing, China wlu@microsoft.com Qingwei Lin Microsoft Beijing, China qlin@microsoft.com Saravan Rajmohan Microsoft 365 Seattle, USA saravar@microsoft.com Thomas Moscibroda Microsoft Redmond, USA moscitho@microsoft.com Dongmei Zhang Microsoft Beijing, China dongmeiz@microsoft.com |
| Pseudocode | Yes | Algorithm 1 CSVE based Offline RL Algorithm |
| Open Source Code | Yes | We implement our method based on an offline deep reinforcement learning library d3rlpy [34]. The code is available at: https://github.com/2023Annonymous Author/csve . |
| Open Datasets | Yes | We conduct experimental evaluations on a variety of classic continuous control tasks of Gym[7] and Adroit[8] in the D4RL[9] benchmark. ... D4RL: Datasets for deep data-driven reinforcement learning. ar Xiv preprint ar Xiv:2004.07219, 2020. |
| Dataset Splits | No | The paper mentions 'train' and 'test' in the context of experiments but does not explicitly describe a validation dataset split or a methodology for it (e.g., percentages, sample counts, or cross-validation setup). |
| Hardware Specification | No | The paper does not provide specific hardware details such as GPU models, CPU types, or memory specifications used for running the experiments. |
| Software Dependencies | No | The paper states 'We implement our method based on an offline deep reinforcement learning library d3rlpy [34]' but does not provide a specific version number for this library or any other software dependencies used in the experiments. |
| Experiment Setup | Yes | Table 3: Hyper-parameters of CSVE evaluation. B 5, number of ensembles in dynamics model; α 10, to control the penalty of OOD states; τ 10, budget parameter in Eq. 8; β In Gym domain, 3 for random and medium tasks, 0.1 for the other tasks; In Adroit domain, 30 for human and cloned tasks, 0.01 for expert tasks; γ 0.99, discount factor; H 1 million for Mujoco while 0.1 million for Adroit tasks; w 0.005, target network smoothing coefficient; lr of actor 3e-4, policy learning rate; lr of critic 1e-4, critic learning rate. |