Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Provably Good Batch Off-Policy Reinforcement Learning Without Great Exploration
Authors: Yao Liu, Adith Swaminathan, Alekh Agarwal, Emma Brunskill
NeurIPS 2020 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We then validate a practical implementation of our algorithm in a discrete task and some continuous control benchmarks. It achieves better and more robust performance with how exploratory the data distribution is, compared with baseline algorithms. This work makes a concrete step forward on providing guarantees on the quality of batch RL with function approximation. |
| Researcher Affiliation | Collaboration | Yao Liu Stanford University EMAIL Adith Swaminathan Microsoft Research EMAIL Alekh Agarwal Microsoft Research EMAIL Emma Brunskill Stanford University EMAIL |
| Pseudocode | Yes | Algorithm 1 Pessimistic Policy Iteration (PPI) Algorithm 2 Pessimistic Q Iteration (PQI) |
| Open Source Code | No | The paper does not provide an explicit statement about open-sourcing their code or a link to a code repository for their methodology. |
| Open Datasets | Yes | We compare PQL with several state-of-the-art batch RL algorithms as well as several baselines, in a subset of tasks in the D4RL batch RL benchmark [11] (halfcheetah-medium, hopper-medium, and walker2d-medium). |
| Dataset Splits | No | The paper mentions collecting data ("10^4 transitions", "1M steps") and evaluating on D4RL tasks, but does not provide specific train/validation/test split percentages, sample counts, or explicit splitting methodology. |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., exact GPU/CPU models, processor types, or memory amounts) used for running its experiments. |
| Software Dependencies | No | The paper does not provide specific ancillary software details (e.g., library or solver names with version numbers) needed to replicate the experiment. |
| Experiment Setup | Yes | Our algorithms need the hyperparameter b to trade off conservatism of a large b (where the algorithm stays at its initialization in the limit) and unfounded optimism of b = 0 (classical FQI/FPI). In discrete spaces, we can set b = n0/n where n0 is our prior for the number of samples we need for reliable distribution estimates and n is the total sample size. In continuous spaces, we can set the threshold to be a percentile of bµ, so as to filter out updates from rare outliers in the dataset. We can also run post-hoc diagnostics on the choice of b by computing the average of ζ(s, π(s)) for the resulting policy π over the batch dataset. |