Offline Reinforcement Learning via High-Fidelity Generative Behavior Modeling

Authors: Huayu Chen, Cheng Lu, Chengyang Ying, Hang Su, Jun Zhu

ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental results on D4RL datasets show that our proposed method achieves competitive or superior performance compared with state-of-the-art offline RL methods, especially in complex tasks such as Ant Maze. We also empirically demonstrate that our method can successfully learn from a heterogeneous dataset containing multiple distinctive but similarly successful strategies, whereas previous unimodal policies fail. The source code is provided at https://github.com/Chen DRAG/Sf BC.
Researcher Affiliation Collaboration Huayu Chen1, Cheng Lu1, Chengyang Ying1, Hang Su1,2 , Jun Zhu1,2 1Department of Computer Science & Technology, Institute for AI, BNRist Center, Tsinghua-Bosch Joint ML Center, THBI Lab, Tsinghua University 2Pazhou Lab, Guangzhou, 510330, China chenhuay21@mails.tsinghua.edu.cn {lucheng.lc15,yingcy17}@gmail.com {suhangss,dcszj}@tsinghua.edu.cn
Pseudocode Yes A ALGORITHM OVERVIEW Algorithm 1 Selecting from Behavior Candidates (Training) Initialize the score-based model sθ, the action evaluation model Qϕ Calculate vanilla discounted returns R(0) n for every state-action pair in dataset Dµ // Training the behavior model for each gradient step do Sample B data points (s, a) from Dµ, B Gaussian noises ϵ from N(0, I) and B time t from U(0, T) Perturb a according to at := αta + σtϵ Update θ λs θ P[ σtsθ(at, s, t) + ϵ 2 2] end for // Training the action evaluation model iteratively for iteration k = 1 to K do Initialize training parameters ϕ of the action evaluation model Qϕ for each gradient step do Sample B data points s, a, R(k 1) from Dµ Update ϕ ϕ λQ ϕ P[ Qϕ(s, a) R(k 1) 2 2] end for // Update the Q-training targets as in Algorithm 2 R(k) = Planning(Dµ, µθ, Qϕ) end for Algorithm 2 Implicit In-sample Planning Input a behavior dataset Dµ (sequentially ordered), a learned behavior policy µθ, a critic model Qϕ // Evaluate every state in dataset according to Equation 13 with M Monte Carlo samples (parallelized) for each minibatch {sn} splitted from Dµ do Sample M actions ˆa1:M n from µθ( |sn), and calculate Q-values ˆR1:M n = Qϕ(sn, ˆa1:M n ) Calculate state value Vn = P exp α ˆRm n ˆRm n m exp α ˆRm n end for // Performing implicit in-sample planning recursively for timestep n = Dµ to 0 do Rn = rn + γ max(Rn+1, Vn+1) if n is not the last episode step, else rn end for Output the new Q-training targets {Rn}
Open Source Code Yes The source code is provided at https://github.com/Chen DRAG/Sf BC.
Open Datasets Yes Experimental results on D4RL datasets show that our proposed method achieves competitive or superior performance compared with state-of-the-art offline RL methods, especially in complex tasks such as Ant Maze. We also empirically demonstrate that our method can successfully learn from a heterogeneous dataset containing multiple distinctive but similarly successful strategies, whereas previous unimodal policies fail. The source code is provided at https://github.com/Chen DRAG/Sf BC. ... In Table 1, we compare the performance of Sf BC to multiple offline RL methods in several D4RL (Fu et al., 2020) tasks.
Dataset Splits No The paper does not explicitly state the specific train/validation/test splits used for the D4RL datasets. It mentions evaluation procedures but not the partitioning of the data for training and validation.
Hardware Specification Yes We test the runtime of our algorithm on a RTX 2080Ti GPU.
Software Dependencies No The paper mentions software components like 'Adam optimizer' and 'PyTorch reimplementation' for baselines, and references specific diffusion model implementations, but does not provide specific version numbers for these software dependencies.
Experiment Setup Yes The conditional scored-based model is trained for 500 data epochs with a learning rate of 1e-4. ... The action evaluation model is trained for 100 data epochs with a learning rate of 1e-3 for each value iteration. We use K = 2 value iterations for all Mu Jo Co tasks, K = 4 for Antmaze-umaze tasks, and K = 5 for other Antmaze tasks. ... with the inverse temperature α set to 20 and the Monte Carlo sample number set to 16 in all tasks. ... we use the Adam optimizer and a batch size of 4096.