Provably Good Batch Off-Policy Reinforcement Learning Without Great Exploration
Authors: Yao Liu, Adith Swaminathan, Alekh Agarwal, Emma Brunskill
NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We then validate a practical implementation of our algorithm in a discrete task and some continuous control benchmarks. It achieves better and more robust performance with how exploratory the data distribution is, compared with baseline algorithms. This work makes a concrete step forward on providing guarantees on the quality of batch RL with function approximation. |
| Researcher Affiliation | Collaboration | Yao Liu Stanford University yaoliu@stanford.edu Adith Swaminathan Microsoft Research adswamin@microsoft.com Alekh Agarwal Microsoft Research alekha@microsoft.com Emma Brunskill Stanford University ebrun@cs.stanford.edu |
| Pseudocode | Yes | Algorithm 1 Pessimistic Policy Iteration (PPI) Algorithm 2 Pessimistic Q Iteration (PQI) |
| Open Source Code | No | The paper does not provide an explicit statement about open-sourcing their code or a link to a code repository for their methodology. |
| Open Datasets | Yes | We compare PQL with several state-of-the-art batch RL algorithms as well as several baselines, in a subset of tasks in the D4RL batch RL benchmark [11] (halfcheetah-medium, hopper-medium, and walker2d-medium). |
| Dataset Splits | No | The paper mentions collecting data ("10^4 transitions", "1M steps") and evaluating on D4RL tasks, but does not provide specific train/validation/test split percentages, sample counts, or explicit splitting methodology. |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., exact GPU/CPU models, processor types, or memory amounts) used for running its experiments. |
| Software Dependencies | No | The paper does not provide specific ancillary software details (e.g., library or solver names with version numbers) needed to replicate the experiment. |
| Experiment Setup | Yes | Our algorithms need the hyperparameter b to trade off conservatism of a large b (where the algorithm stays at its initialization in the limit) and unfounded optimism of b = 0 (classical FQI/FPI). In discrete spaces, we can set b = n0/n where n0 is our prior for the number of samples we need for reliable distribution estimates and n is the total sample size. In continuous spaces, we can set the threshold to be a percentile of bµ, so as to filter out updates from rare outliers in the dataset. We can also run post-hoc diagnostics on the choice of b by computing the average of ζ(s, π(s)) for the resulting policy π over the batch dataset. |