Provably Good Batch Off-Policy Reinforcement Learning Without Great Exploration

Authors: Yao Liu, Adith Swaminathan, Alekh Agarwal, Emma Brunskill

NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We then validate a practical implementation of our algorithm in a discrete task and some continuous control benchmarks. It achieves better and more robust performance with how exploratory the data distribution is, compared with baseline algorithms. This work makes a concrete step forward on providing guarantees on the quality of batch RL with function approximation.
Researcher Affiliation Collaboration Yao Liu Stanford University yaoliu@stanford.edu Adith Swaminathan Microsoft Research adswamin@microsoft.com Alekh Agarwal Microsoft Research alekha@microsoft.com Emma Brunskill Stanford University ebrun@cs.stanford.edu
Pseudocode Yes Algorithm 1 Pessimistic Policy Iteration (PPI) Algorithm 2 Pessimistic Q Iteration (PQI)
Open Source Code No The paper does not provide an explicit statement about open-sourcing their code or a link to a code repository for their methodology.
Open Datasets Yes We compare PQL with several state-of-the-art batch RL algorithms as well as several baselines, in a subset of tasks in the D4RL batch RL benchmark [11] (halfcheetah-medium, hopper-medium, and walker2d-medium).
Dataset Splits No The paper mentions collecting data ("10^4 transitions", "1M steps") and evaluating on D4RL tasks, but does not provide specific train/validation/test split percentages, sample counts, or explicit splitting methodology.
Hardware Specification No The paper does not provide specific hardware details (e.g., exact GPU/CPU models, processor types, or memory amounts) used for running its experiments.
Software Dependencies No The paper does not provide specific ancillary software details (e.g., library or solver names with version numbers) needed to replicate the experiment.
Experiment Setup Yes Our algorithms need the hyperparameter b to trade off conservatism of a large b (where the algorithm stays at its initialization in the limit) and unfounded optimism of b = 0 (classical FQI/FPI). In discrete spaces, we can set b = n0/n where n0 is our prior for the number of samples we need for reliable distribution estimates and n is the total sample size. In continuous spaces, we can set the threshold to be a percentile of bµ, so as to filter out updates from rare outliers in the dataset. We can also run post-hoc diagnostics on the choice of b by computing the average of ζ(s, π(s)) for the resulting policy π over the batch dataset.