Adversarial Model for Offline Reinforcement Learning

Authors: Mohak Bhardwaj, Tengyang Xie, Byron Boots, Nan Jiang, Ching-An Cheng

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental To validate these properties in practice, we design a scalable implementation of ARMOR, which by adversarial training, can optimize policies without using model ensembles in contrast to typical model-based methods. We show that ARMOR achieves competent performance with both state-of-the-art offline model-free and model-based RL algorithms and can robustly improve the reference policy over various hyperparameter choices.
Researcher Affiliation Collaboration Mohak Bhardwaj University of Washington mohakb@cs.washington.edu Tengyang Xie Microsoft Research & UW-Madison tx@cs.wisc.edu Byron Boots University of Washington bboots@cs.washington.edu Nan Jiang UIUC nanjiang@illinois.edu Ching-An Cheng Microsoft Research, Redmond chinganc@microsoft.com
Pseudocode Yes Algorithm 1 ARMOR (Adversarial Model for Offline Reinforcement Learning)
Open Source Code Yes Open source code is available at: https://sites.google.com/view/armorofflinerl/.
Open Datasets Yes We use the D4RL (Fu et al., 2020) continuous control benchmarks datasets for all our experiments and the code will be made public.
Dataset Splits No The paper mentions using D4RL datasets and refers to training steps (e.g., 'ARMOR is then trained for 1M steps on each dataset'), but it does not specify explicit training, validation, or test dataset splits (e.g., percentages, sample counts, or citations to predefined splits) within the text for reproducibility.
Hardware Specification Yes Each run of ARMOR has access to 4CPUs with 28GB RAM and a single Nvidia T4 GPU with 16GB memory.
Software Dependencies No The paper mentions using 'Adam optimizer' (Kingma and Ba, 2015) and 'feedforward neural networks' but does not specify software dependencies with version numbers (e.g., Python 3.8, PyTorch 1.9, CUDA 11.1).
Experiment Setup Yes We parameterize π, f1, f2 and M using feedforward neural networks, and set ηfast = 5e 4, ηslow = 5e 7, w = 0.5 similar to Cheng et al. (2022). In all our experiments, we vary only the β and λ parameters which control the amount of pessimism; others are fixed. Importantly, we set the rollout horizon to be the max episode horizon defined in the environment. The dynamics model is pre-trained for 100k steps using model-fitting loss on the offline dataset. ARMOR is then trained for 1M steps on each dataset. Refer to Appendix F for more details.