Offline Reinforcement Learning via High-Fidelity Generative Behavior Modeling
Authors: Huayu Chen, Cheng Lu, Chengyang Ying, Hang Su, Jun Zhu
ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimental results on D4RL datasets show that our proposed method achieves competitive or superior performance compared with state-of-the-art offline RL methods, especially in complex tasks such as Ant Maze. We also empirically demonstrate that our method can successfully learn from a heterogeneous dataset containing multiple distinctive but similarly successful strategies, whereas previous unimodal policies fail. The source code is provided at https://github.com/Chen DRAG/Sf BC. |
| Researcher Affiliation | Collaboration | Huayu Chen1, Cheng Lu1, Chengyang Ying1, Hang Su1,2 , Jun Zhu1,2 1Department of Computer Science & Technology, Institute for AI, BNRist Center, Tsinghua-Bosch Joint ML Center, THBI Lab, Tsinghua University 2Pazhou Lab, Guangzhou, 510330, China chenhuay21@mails.tsinghua.edu.cn {lucheng.lc15,yingcy17}@gmail.com {suhangss,dcszj}@tsinghua.edu.cn |
| Pseudocode | Yes | A ALGORITHM OVERVIEW Algorithm 1 Selecting from Behavior Candidates (Training) Initialize the score-based model sθ, the action evaluation model Qϕ Calculate vanilla discounted returns R(0) n for every state-action pair in dataset Dµ // Training the behavior model for each gradient step do Sample B data points (s, a) from Dµ, B Gaussian noises ϵ from N(0, I) and B time t from U(0, T) Perturb a according to at := αta + σtϵ Update θ λs θ P[ σtsθ(at, s, t) + ϵ 2 2] end for // Training the action evaluation model iteratively for iteration k = 1 to K do Initialize training parameters ϕ of the action evaluation model Qϕ for each gradient step do Sample B data points s, a, R(k 1) from Dµ Update ϕ ϕ λQ ϕ P[ Qϕ(s, a) R(k 1) 2 2] end for // Update the Q-training targets as in Algorithm 2 R(k) = Planning(Dµ, µθ, Qϕ) end for Algorithm 2 Implicit In-sample Planning Input a behavior dataset Dµ (sequentially ordered), a learned behavior policy µθ, a critic model Qϕ // Evaluate every state in dataset according to Equation 13 with M Monte Carlo samples (parallelized) for each minibatch {sn} splitted from Dµ do Sample M actions ˆa1:M n from µθ( |sn), and calculate Q-values ˆR1:M n = Qϕ(sn, ˆa1:M n ) Calculate state value Vn = P exp α ˆRm n ˆRm n m exp α ˆRm n end for // Performing implicit in-sample planning recursively for timestep n = Dµ to 0 do Rn = rn + γ max(Rn+1, Vn+1) if n is not the last episode step, else rn end for Output the new Q-training targets {Rn} |
| Open Source Code | Yes | The source code is provided at https://github.com/Chen DRAG/Sf BC. |
| Open Datasets | Yes | Experimental results on D4RL datasets show that our proposed method achieves competitive or superior performance compared with state-of-the-art offline RL methods, especially in complex tasks such as Ant Maze. We also empirically demonstrate that our method can successfully learn from a heterogeneous dataset containing multiple distinctive but similarly successful strategies, whereas previous unimodal policies fail. The source code is provided at https://github.com/Chen DRAG/Sf BC. ... In Table 1, we compare the performance of Sf BC to multiple offline RL methods in several D4RL (Fu et al., 2020) tasks. |
| Dataset Splits | No | The paper does not explicitly state the specific train/validation/test splits used for the D4RL datasets. It mentions evaluation procedures but not the partitioning of the data for training and validation. |
| Hardware Specification | Yes | We test the runtime of our algorithm on a RTX 2080Ti GPU. |
| Software Dependencies | No | The paper mentions software components like 'Adam optimizer' and 'PyTorch reimplementation' for baselines, and references specific diffusion model implementations, but does not provide specific version numbers for these software dependencies. |
| Experiment Setup | Yes | The conditional scored-based model is trained for 500 data epochs with a learning rate of 1e-4. ... The action evaluation model is trained for 100 data epochs with a learning rate of 1e-3 for each value iteration. We use K = 2 value iterations for all Mu Jo Co tasks, K = 4 for Antmaze-umaze tasks, and K = 5 for other Antmaze tasks. ... with the inverse temperature α set to 20 and the Monte Carlo sample number set to 16 in all tasks. ... we use the Adam optimizer and a batch size of 4096. |