Adversarially Trained Actor Critic for Offline Reinforcement Learning
Authors: Ching-An Cheng, Tengyang Xie, Nan Jiang, Alekh Agarwal
ICML 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We test ATAC on the D4RL benchmark (Fu et al., 2020), and ATAC consistently outperforms state-of-the-art baselines across multiple continuous-control problems. These empirical results also validate the robust policy improvement property of ATAC (Fig. 1), which makes ATAC suitable for risk sensitive applications. |
| Researcher Affiliation | Collaboration | 1Microsoft Research 2University of Illinois at Urbana-Champaign 3Google Research |
| Pseudocode | Yes | Algorithm 1 ATAC (Theoretical Version) and Algorithm 2 ATAC (Practical Version) are provided. |
| Open Source Code | Yes | The code is available at https: //github.com/microsoft/ATAC. |
| Open Datasets | Yes | We test ATAC on the D4RL benchmark (Fu et al., 2020) |
| Dataset Splits | No | The paper mentions 'train' and 'test' for D4RL benchmark, but does not explicitly specify train/validation/test splits, percentages, or absolute sample counts for data partitioning required for reproduction. It does mention '10 random seeds' and '100 epochs of BC for warm start; followed by 900 epochs of ATAC' for training. However, it does not specify how the data itself is split. |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., GPU/CPU models, memory amounts) used for running the experiments. It only mentions 'D4RL benchmark' environments. |
| Software Dependencies | No | The paper mentions 'ADAM (Kingma & Ba, 2015)' for optimization and 'neural networks' for models, but does not provide specific version numbers for software dependencies like Python, PyTorch, TensorFlow, or CUDA. |
| Experiment Setup | Yes | In Algorithm 2, we use ηfast = 0.0005 and ηslow = 10 3ηfast based on an offline tuning heuristic, τ = 0.005 from the work of Haarnoja et al. (2018), and w = 0.5, across all domains. ... We run 100 epochs of BC for warm start; followed by 900 epochs of ATAC, where 1 epoch denotes 2K gradient updates. ... minibatch size |Dmini| = 256, and the two-timescale stepsizes are set as ηfast = 0.0005 and ηslow = 10 3ηfast. These stepsizes ηfast and ηslow were selected offline with a heuristic: Since ATAC with β = 0 is IPM-IL, we did a grid search (over ηfast {5e 4, 5e 5, 5e 6} and ηslow = {5e 5, 5e 6, 5e 7}, on the hopper-medium and hopper-expert datasets) and selected the combination that attains the lowest ℓ2 IL error after 100 epochs. We set w = 0.5 in Eq.(7)... We use τ = 0.005 for target network update... The discount is set to the common γ = 0.99. The regularization coefficient β is our only hyperparameter that varies across datasets... We consider β in {0, 4 4, 4 3, 4 2, 4 1, 1, 4, 42, 43, 44}. |