Supported Trust Region Optimization for Offline Reinforcement Learning
Authors: Yixiu Mao, Hongchang Zhang, Chen Chen, Yi Xu, Xiangyang Ji
ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Empirical results validate the theory of STR and demonstrate its state-of-the-art performance on Mu Jo Co locomotion domains and much more challenging Ant Maze domains. |
| Researcher Affiliation | Academia | 1Department of Automation, Tsinghua University 2School of Artificial Intelligence, Dalian University of Technology. |
| Pseudocode | Yes | Algorithm 1 STR (Tabular) and Algorithm 2 STR (Practical) are provided in Section 4.3. |
| Open Source Code | No | The paper does not provide an explicit statement or link for open-source code for the methodology described. |
| Open Datasets | Yes | We test the effectiveness of STR (Algorithm 2) in terms of performance, safe policy improvement, and hyperparameter robustness using the D4RL benchmark (Fu et al., 2020). |
| Dataset Splits | No | The paper mentions evaluating performance on trajectories but does not specify the train/validation/test dataset splits (e.g., percentages or sample counts for each split). |
| Hardware Specification | Yes | We test the runtime of STR on halfcheetah-medium-replay on a Ge Force RTX 3090. |
| Software Dependencies | No | The paper mentions optimizers and algorithms like |
| Experiment Setup | Yes | Table 3. Hyperparameters of policy training in STR, includes: Critic learning rate 3e-4, Actor learning rate 3e-4 with cosine schedule, Batch size 256, Discount factor 0.99, Number of iterations 1e6, Target update rate τ 0.005, Policy update frequency 2, Number of Critics 4, Temperature λ {0.5, 2} for Gym-Mu Jo Co {0.1} for Ant Maze, Variance of Gaussian Policy 0.1. |