Conservative State Value Estimation for Offline Reinforcement Learning

Authors: Liting Chen, Jie Yan, Zhengdao Shao, Lu Wang, Qingwei Lin, Saravanakumar Rajmohan, Thomas Moscibroda, Dongmei Zhang

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate in classic continual control tasks of D4RL, showing that our method performs better than the conservative Q-function learning methods and is strongly competitive among recent SOTA methods. ... Experimental evaluation on continuous control tasks of Gym [7] and Adroit [8] in D4RL [9] benchmarks, showing that CSVE performs better than prior methods based on conservative Q-value estimation, and is strongly competitive among main SOTA algorithms.
Researcher Affiliation Collaboration Liting Chen Mc Gill University Montreal, Canada 98chenliting@gmail.com Jie Yan Microsoft Beijing, China dasistyanjie@gmail.com Zhengdao Shao University of Sci. and Tech. of China Hefei, China zhengdaoshao@mail.ustc.edu.cn Lu Wang Microsoft Beijing, China wlu@microsoft.com Qingwei Lin Microsoft Beijing, China qlin@microsoft.com Saravan Rajmohan Microsoft 365 Seattle, USA saravar@microsoft.com Thomas Moscibroda Microsoft Redmond, USA moscitho@microsoft.com Dongmei Zhang Microsoft Beijing, China dongmeiz@microsoft.com
Pseudocode Yes Algorithm 1 CSVE based Offline RL Algorithm
Open Source Code Yes We implement our method based on an offline deep reinforcement learning library d3rlpy [34]. The code is available at: https://github.com/2023Annonymous Author/csve .
Open Datasets Yes We conduct experimental evaluations on a variety of classic continuous control tasks of Gym[7] and Adroit[8] in the D4RL[9] benchmark. ... D4RL: Datasets for deep data-driven reinforcement learning. ar Xiv preprint ar Xiv:2004.07219, 2020.
Dataset Splits No The paper mentions 'train' and 'test' in the context of experiments but does not explicitly describe a validation dataset split or a methodology for it (e.g., percentages, sample counts, or cross-validation setup).
Hardware Specification No The paper does not provide specific hardware details such as GPU models, CPU types, or memory specifications used for running the experiments.
Software Dependencies No The paper states 'We implement our method based on an offline deep reinforcement learning library d3rlpy [34]' but does not provide a specific version number for this library or any other software dependencies used in the experiments.
Experiment Setup Yes Table 3: Hyper-parameters of CSVE evaluation. B 5, number of ensembles in dynamics model; α 10, to control the penalty of OOD states; τ 10, budget parameter in Eq. 8; β In Gym domain, 3 for random and medium tasks, 0.1 for the other tasks; In Adroit domain, 30 for human and cloned tasks, 0.01 for expert tasks; γ 0.99, discount factor; H 1 million for Mujoco while 0.1 million for Adroit tasks; w 0.005, target network smoothing coefficient; lr of actor 3e-4, policy learning rate; lr of critic 1e-4, critic learning rate.