In-sample Actor Critic for Offline Reinforcement Learning
Authors: Hongchang Zhang, Yixiu Mao, Boyuan Wang, Shuncheng He, Yi Xu, Xiangyang Ji
ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Empirical results show that IAC obtains competitive performance compared to the state-of-the-art methods on Gym-Mu Jo Co locomotion domains and much more challenging Ant Maze domains. |
| Researcher Affiliation | Academia | 1Tsinghua University 2Dalian University of Technology {hc-zhang19,myx21,wangby22,hesc16}@mails.tsinghua.edu, yxu@dlut.edu, xyji@tsinghua.edu |
| Pseudocode | Yes | Algorithm 1 IAC |
| Open Source Code | No | The paper does not provide an explicit statement or link to the open-source code for the methodology described. |
| Open Datasets | Yes | We test IAC on D4RL benchmark (Fu et al., 2020), including Gym-Mu Jo Co locomotion domains and much more challenging Ant Maze domains. |
| Dataset Splits | No | The paper mentions using the D4RL benchmark, but it does not explicitly provide details about the training, validation, or test dataset splits (e.g., percentages, sample counts, or specific citations for splits). |
| Hardware Specification | Yes | We test the runtime of IAC on halfcheetah-medium-replay on a Ge Force RTX 3090. |
| Software Dependencies | No | The paper mentions 'Optimizer Adam' but does not specify a version number for the software library or framework used (e.g., PyTorch, TensorFlow version). |
| Experiment Setup | Yes | Table 3: Hyperparameters of policy training in IAC. Optimizer Adam (Kingma & Ba, 2014) Critic learning rate 3 10 4 Actor learning rate 3 10 4 with cosine schedule Batch size 256 Discount factor 0.99 Number of iterations 106 Target update rate τ 0.005 Policy update frequency 2 Inverse temperature of AWR β {0.25, 5} for Gym-Mu Jo Co {10} for Ant Maze Variance of Gaussian Policy 0.1 Architecture Actor input-256-256-output Critic input-256-256-1 |