Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Online Robust Reinforcement Learning with Model Uncertainty
Authors: Yue Wang, Shaofeng Zou
NeurIPS 2021 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our numerical experiments further demonstrate the robustness of our algorithms. |
| Researcher Affiliation | Academia | Yue Wang University at Buffalo Buffalo, NY 14228 EMAIL Shaofeng Zou University at Buffalo Buffalo, NY 14228 EMAIL |
| Pseudocode | Yes | Algorithm 1 Robust Q-Learning; Algorithm 2 Robust TDC with Linear Function Approximation |
| Open Source Code | No | The paper does not provide any links to open-source code or explicitly state that code is made available. |
| Open Datasets | Yes | We use Open AI gym framework [Brockman et al., 2016], and consider two different problems: Frozen lake and Cart-Pole. |
| Dataset Splits | No | The paper describes training on a 'perturbed MDP' and testing on an 'unperturbed MDP' but does not specify a separate validation split or its methodology. |
| Hardware Specification | No | The paper does not specify any hardware used for the experiments (e.g., CPU, GPU models). |
| Software Dependencies | No | The paper mentions 'Open AI gym framework' but does not provide version numbers for this or any other software components. |
| Experiment Setup | Yes | The behavior policy for all the experiments below is set to be a uniform distribution over the action space given any state, i.e., πb(a|s) = 1 |A| for any s S and a A. We take the average over 30 trajectories. We set α = 0.2 and γ = 0.9. |