In-sample Actor Critic for Offline Reinforcement Learning

Authors: Hongchang Zhang, Yixiu Mao, Boyuan Wang, Shuncheng He, Yi Xu, Xiangyang Ji

ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Empirical results show that IAC obtains competitive performance compared to the state-of-the-art methods on Gym-Mu Jo Co locomotion domains and much more challenging Ant Maze domains.
Researcher Affiliation Academia 1Tsinghua University 2Dalian University of Technology {hc-zhang19,myx21,wangby22,hesc16}@mails.tsinghua.edu, yxu@dlut.edu, xyji@tsinghua.edu
Pseudocode Yes Algorithm 1 IAC
Open Source Code No The paper does not provide an explicit statement or link to the open-source code for the methodology described.
Open Datasets Yes We test IAC on D4RL benchmark (Fu et al., 2020), including Gym-Mu Jo Co locomotion domains and much more challenging Ant Maze domains.
Dataset Splits No The paper mentions using the D4RL benchmark, but it does not explicitly provide details about the training, validation, or test dataset splits (e.g., percentages, sample counts, or specific citations for splits).
Hardware Specification Yes We test the runtime of IAC on halfcheetah-medium-replay on a Ge Force RTX 3090.
Software Dependencies No The paper mentions 'Optimizer Adam' but does not specify a version number for the software library or framework used (e.g., PyTorch, TensorFlow version).
Experiment Setup Yes Table 3: Hyperparameters of policy training in IAC. Optimizer Adam (Kingma & Ba, 2014) Critic learning rate 3 10 4 Actor learning rate 3 10 4 with cosine schedule Batch size 256 Discount factor 0.99 Number of iterations 106 Target update rate τ 0.005 Policy update frequency 2 Inverse temperature of AWR β {0.25, 5} for Gym-Mu Jo Co {10} for Ant Maze Variance of Gaussian Policy 0.1 Architecture Actor input-256-256-output Critic input-256-256-1