DeepAveragers: Offline Reinforcement Learning By Solving Derived Non-Parametric MDPs
Authors: Aayam Kumar Shrestha, Stefan Lee, Prasad Tadepalli, Alan Fern
ICLR 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In theory, we show conditions that allow for lower-bounding the performance of DAC-MDP solutions. We also investigate the empirical behavior in a number of environments, including those with imagebased observations. Overall, the experiments demonstrate that the framework can work in practice and scale to large complex offline RL problems. |
| Researcher Affiliation | Academia | Aayam Shrestha, Stefan Lee, Prasad Tadepalli, Alan Fern Oregon State University Corvallis, OR 97330, USA {shrestaa, leestef, tadepall, alan.fern}@oregonstate.edu |
| Pseudocode | Yes | Pseudocode 1 GPU Value Iteration Kernel ... Pseudocode 2 GPU Value Iteration Function |
| Open Source Code | Yes | As an additional engineering contribution, this implementation will be made public. |
| Open Datasets | No | The paper states, 'Following recent work (Fujimoto et al., 2019), we generate datasets by first training a DQN agent for each game.' and 'We generate three datasets of size 100k each:'. It indicates that the datasets were generated by the authors, not that they are publicly available with a link or formal citation for access. |
| Dataset Splits | No | The paper mentions training and testing, but does not explicitly specify a separate validation dataset or its split percentages/counts for model selection. |
| Hardware Specification | Yes | We consider 3 GPUs, namely, GTX 1080ti, RTX 8000, and Tesla V100, each with a CUDA core count of 3584, 4608 and 6912, respectively. The serial implementation is run on an Intel Xeon processor. |
| Software Dependencies | No | The paper mentions 'Adam Kingma & Ba (2015)' as the network optimizer and discusses CUDA, but does not provide specific version numbers for these or other key software libraries like Python, PyTorch, or TensorFlow. |
| Experiment Setup | Yes | Table 3: All Hyperparameters for DQN and BCQ [Atari] includes: Learning rate 0.0000625, Discount γ 0.99, Mini-batch size 32, Target network update frequency 8k training updates, Evaluation ϵ 0.001, Threshold τ (BCQ) 0.3. |