DeepAveragers: Offline Reinforcement Learning By Solving Derived Non-Parametric MDPs

Authors: Aayam Kumar Shrestha, Stefan Lee, Prasad Tadepalli, Alan Fern

ICLR 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In theory, we show conditions that allow for lower-bounding the performance of DAC-MDP solutions. We also investigate the empirical behavior in a number of environments, including those with imagebased observations. Overall, the experiments demonstrate that the framework can work in practice and scale to large complex offline RL problems.
Researcher Affiliation Academia Aayam Shrestha, Stefan Lee, Prasad Tadepalli, Alan Fern Oregon State University Corvallis, OR 97330, USA {shrestaa, leestef, tadepall, alan.fern}@oregonstate.edu
Pseudocode Yes Pseudocode 1 GPU Value Iteration Kernel ... Pseudocode 2 GPU Value Iteration Function
Open Source Code Yes As an additional engineering contribution, this implementation will be made public.
Open Datasets No The paper states, 'Following recent work (Fujimoto et al., 2019), we generate datasets by first training a DQN agent for each game.' and 'We generate three datasets of size 100k each:'. It indicates that the datasets were generated by the authors, not that they are publicly available with a link or formal citation for access.
Dataset Splits No The paper mentions training and testing, but does not explicitly specify a separate validation dataset or its split percentages/counts for model selection.
Hardware Specification Yes We consider 3 GPUs, namely, GTX 1080ti, RTX 8000, and Tesla V100, each with a CUDA core count of 3584, 4608 and 6912, respectively. The serial implementation is run on an Intel Xeon processor.
Software Dependencies No The paper mentions 'Adam Kingma & Ba (2015)' as the network optimizer and discusses CUDA, but does not provide specific version numbers for these or other key software libraries like Python, PyTorch, or TensorFlow.
Experiment Setup Yes Table 3: All Hyperparameters for DQN and BCQ [Atari] includes: Learning rate 0.0000625, Discount γ 0.99, Mini-batch size 32, Target network update frequency 8k training updates, Evaluation ϵ 0.001, Threshold τ (BCQ) 0.3.