Offline Reinforcement Learning with Closed-Form Policy Improvement Operators
Authors: Jiachen Li, Edwin Zhang, Ming Yin, Qinxun Bai, Yu-Xiang Wang, William Yang Wang
ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We instantiate both one-step and iterative offline RL algorithms with our novel policy improvement operators and empirically demonstrate their effectiveness over state-of-the-art algorithms on the standard D4RL benchmark. |
| Researcher Affiliation | Collaboration | 1Department of Computer Science, University of California, Santa Barbara, Santa Barbara, CA 93106 USA, USA 2Harvard University, USA 3Horizon Robotics Inc., Cupertino, CA, 95014 USA. |
| Pseudocode | Yes | Algorithm 1 Offline RL with CFPI operators |
| Open Source Code | Yes | Our code is available at https://cfpi-icml23.github.io/. |
| Open Datasets | Yes | We evaluate the effectiveness of our one-step algorithm on the D4RL benchmark focusing on the Gym-Mu Jo Co domain... standard D4RL benchmark (Fu et al., 2020). |
| Dataset Splits | Yes | Next, we randomly split the dataset with the ratio 95/5 to create the trainining set Dtrain validation set Dval. |
| Hardware Specification | No | The paper states, "Our experiments are conducted on various types of 8GPUs machines. Different machines may have different GPU types, such as NVIDIA GA100 and TU102." This general description does not provide specific, consistent model numbers or detailed configurations for the entire experimental setup. |
| Software Dependencies | Yes | We use the Adam (Kingma & Ba, 2014) optimizer for all learning algorithms... We use the PyTorch (Paszke et al., 2019) Implementation of IQL from RLkit (Berkeley)... |
| Experiment Setup | Yes | Table 8 includes the HP of methods evaluated on the Gym-Mu Jo Co domain. MG-BC. We train the policy for 500K gradient steps. SARSA. We parameterize the value function with the IQN (Dabney et al., 2018a) architecture and train it to model the distribution Zβ... |