Adversarially Trained Actor Critic for Offline Reinforcement Learning

Authors: Ching-An Cheng, Tengyang Xie, Nan Jiang, Alekh Agarwal

ICML 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We test ATAC on the D4RL benchmark (Fu et al., 2020), and ATAC consistently outperforms state-of-the-art baselines across multiple continuous-control problems. These empirical results also validate the robust policy improvement property of ATAC (Fig. 1), which makes ATAC suitable for risk sensitive applications.
Researcher Affiliation Collaboration 1Microsoft Research 2University of Illinois at Urbana-Champaign 3Google Research
Pseudocode Yes Algorithm 1 ATAC (Theoretical Version) and Algorithm 2 ATAC (Practical Version) are provided.
Open Source Code Yes The code is available at https: //github.com/microsoft/ATAC.
Open Datasets Yes We test ATAC on the D4RL benchmark (Fu et al., 2020)
Dataset Splits No The paper mentions 'train' and 'test' for D4RL benchmark, but does not explicitly specify train/validation/test splits, percentages, or absolute sample counts for data partitioning required for reproduction. It does mention '10 random seeds' and '100 epochs of BC for warm start; followed by 900 epochs of ATAC' for training. However, it does not specify how the data itself is split.
Hardware Specification No The paper does not provide specific hardware details (e.g., GPU/CPU models, memory amounts) used for running the experiments. It only mentions 'D4RL benchmark' environments.
Software Dependencies No The paper mentions 'ADAM (Kingma & Ba, 2015)' for optimization and 'neural networks' for models, but does not provide specific version numbers for software dependencies like Python, PyTorch, TensorFlow, or CUDA.
Experiment Setup Yes In Algorithm 2, we use ηfast = 0.0005 and ηslow = 10 3ηfast based on an offline tuning heuristic, τ = 0.005 from the work of Haarnoja et al. (2018), and w = 0.5, across all domains. ... We run 100 epochs of BC for warm start; followed by 900 epochs of ATAC, where 1 epoch denotes 2K gradient updates. ... minibatch size |Dmini| = 256, and the two-timescale stepsizes are set as ηfast = 0.0005 and ηslow = 10 3ηfast. These stepsizes ηfast and ηslow were selected offline with a heuristic: Since ATAC with β = 0 is IPM-IL, we did a grid search (over ηfast {5e 4, 5e 5, 5e 6} and ηslow = {5e 5, 5e 6, 5e 7}, on the hopper-medium and hopper-expert datasets) and selected the combination that attains the lowest ℓ2 IL error after 100 epochs. We set w = 0.5 in Eq.(7)... We use τ = 0.005 for target network update... The discount is set to the common γ = 0.99. The regularization coefficient β is our only hyperparameter that varies across datasets... We consider β in {0, 4 4, 4 3, 4 2, 4 1, 1, 4, 42, 43, 44}.