Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Black-box Off-policy Estimation for Infinite-Horizon Reinforcement Learning
Authors: Ali Mousavi, Lihong Li, Qiang Liu, Denny Zhou
ICLR 2020 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments on benchmarks verify the effectiveness of our approach. |
| Researcher Affiliation | Collaboration | Google Research EMAIL Qiang Liu Dengyong Zhou University of Texas, Austin EMAIL |
| Pseudocode | Yes | E PSEUDO-CODE OF ALGORITHM This section includes the pseudo-code of our algorithm that we described in Section 4. Algorithm 1 Black-box Off-policy Estimator based on MMD |
| Open Source Code | No | The paper does not provide any explicit statement about releasing source code or a link to a code repository for the described methodology. |
| Open Datasets | Yes | We now focus on four classic control problems... Pendulum... Mountain Car... Cartpole... Acrobot. These are well-known, publicly available reinforcement learning environments. |
| Dataset Splits | No | The paper describes how samples are generated ("trajectories") and used for estimation, and mentions Monte-Carlo samples for reporting results, but does not specify explicit training, validation, and test dataset splits with percentages or counts. |
| Hardware Specification | No | The paper does not provide specific details regarding the hardware (e.g., GPU/CPU models, memory specifications) used to run the experiments. |
| Software Dependencies | No | The paper mentions using a "3-layer (...) feed-forward neural network with the sigmoid activation function" but does not specify any software libraries, frameworks, or their version numbers that were used for implementation. |
| Experiment Setup | Yes | For each environment, we train a near-optimal policy π+ using the Neural Fitted Q Iteration algorithm (Riedmiller, 2005). We then set the behavior and target policies as πBEH = α1π+ + (1 α1)π and π = α2π+ + (1 α2)π , where π denotes a random policy, and 0 α1, α2 1 are two constant values making the behavior policy distinct from the target policy. In our experiments, we set α1 = 0.7 and α2 = 0.9. In all the cases, we use a 3-layer (having 30, 20, and 10 hidden neurons) feed-forward neural network with the sigmoid activation function as our parametric model in equation 8. |