reproducibilityindex.ai

Adversarial Model for Offline Reinforcement Learning

Authors: Mohak Bhardwaj, Tengyang Xie, Byron Boots, Nan Jiang, Ching-An Cheng

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	To validate these properties in practice, we design a scalable implementation of ARMOR, which by adversarial training, can optimize policies without using model ensembles in contrast to typical model-based methods. We show that ARMOR achieves competent performance with both state-of-the-art offline model-free and model-based RL algorithms and can robustly improve the reference policy over various hyperparameter choices.
Researcher Affiliation	Collaboration	Mohak Bhardwaj University of Washington mohakb@cs.washington.edu Tengyang Xie Microsoft Research & UW-Madison tx@cs.wisc.edu Byron Boots University of Washington bboots@cs.washington.edu Nan Jiang UIUC nanjiang@illinois.edu Ching-An Cheng Microsoft Research, Redmond chinganc@microsoft.com
Pseudocode	Yes	Algorithm 1 ARMOR (Adversarial Model for Offline Reinforcement Learning)
Open Source Code	Yes	Open source code is available at: https://sites.google.com/view/armorofflinerl/.
Open Datasets	Yes	We use the D4RL (Fu et al., 2020) continuous control benchmarks datasets for all our experiments and the code will be made public.
Dataset Splits	No	The paper mentions using D4RL datasets and refers to training steps (e.g., 'ARMOR is then trained for 1M steps on each dataset'), but it does not specify explicit training, validation, or test dataset splits (e.g., percentages, sample counts, or citations to predefined splits) within the text for reproducibility.
Hardware Specification	Yes	Each run of ARMOR has access to 4CPUs with 28GB RAM and a single Nvidia T4 GPU with 16GB memory.
Software Dependencies	No	The paper mentions using 'Adam optimizer' (Kingma and Ba, 2015) and 'feedforward neural networks' but does not specify software dependencies with version numbers (e.g., Python 3.8, PyTorch 1.9, CUDA 11.1).
Experiment Setup	Yes	We parameterize π, f1, f2 and M using feedforward neural networks, and set ηfast = 5e 4, ηslow = 5e 7, w = 0.5 similar to Cheng et al. (2022). In all our experiments, we vary only the β and λ parameters which control the amount of pessimism; others are fixed. Importantly, we set the rollout horizon to be the max episode horizon defined in the environment. The dynamics model is pre-trained for 100k steps using model-fitting loss on the offline dataset. ARMOR is then trained for 1M steps on each dataset. Refer to Appendix F for more details.