Supported Trust Region Optimization for Offline Reinforcement Learning

Authors: Yixiu Mao, Hongchang Zhang, Chen Chen, Yi Xu, Xiangyang Ji

ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Empirical results validate the theory of STR and demonstrate its state-of-the-art performance on Mu Jo Co locomotion domains and much more challenging Ant Maze domains.
Researcher Affiliation Academia 1Department of Automation, Tsinghua University 2School of Artificial Intelligence, Dalian University of Technology.
Pseudocode Yes Algorithm 1 STR (Tabular) and Algorithm 2 STR (Practical) are provided in Section 4.3.
Open Source Code No The paper does not provide an explicit statement or link for open-source code for the methodology described.
Open Datasets Yes We test the effectiveness of STR (Algorithm 2) in terms of performance, safe policy improvement, and hyperparameter robustness using the D4RL benchmark (Fu et al., 2020).
Dataset Splits No The paper mentions evaluating performance on trajectories but does not specify the train/validation/test dataset splits (e.g., percentages or sample counts for each split).
Hardware Specification Yes We test the runtime of STR on halfcheetah-medium-replay on a Ge Force RTX 3090.
Software Dependencies No The paper mentions optimizers and algorithms like
Experiment Setup Yes Table 3. Hyperparameters of policy training in STR, includes: Critic learning rate 3e-4, Actor learning rate 3e-4 with cosine schedule, Batch size 256, Discount factor 0.99, Number of iterations 1e6, Target update rate τ 0.005, Policy update frequency 2, Number of Critics 4, Temperature λ {0.5, 2} for Gym-Mu Jo Co {0.1} for Ant Maze, Variance of Gaussian Policy 0.1.