Adaptive Temporal-Difference Learning for Policy Evaluation with Per-State Uncertainty Estimates

Authors: Carlos Riquelme, Hugo Penedones, Damien Vincent, Hartmut Maennel, Sylvain Gelly, Timothy A. Mann, Andre Barreto, Gergely Neu

NeurIPS 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this section we test the performance of Adaptive TD in a number of scenarios that we describe below. The scenarios (for which we fix a specific policy) capture a diverse set of aspects that are relevant to policy evaluation: low and high-dimensional state spaces, sharp value jumps or smoother epsilon-greedy behaviors, near-optimal and uniformly random policies. We present here the results for Labyrinth-2D and Atari environments, and Mountain Car is presented in the appendix, Section C. We compare Adaptive TD with a few baselines: a single MC network, raw TD, and TD()..
Researcher Affiliation Collaboration Hugo Penedones, Carlos Riquelme, Google Brain, Damien Vincent, Google Brain, Hartmut Maennel, Google Brain, Timothy Mann, André Barreto, Sylvain Gelly, Google Brain, Gergely Neu, Universitat Pompeu Fabra
Pseudocode Yes Algorithm 1: Adaptive TD
Open Source Code No No explicit statement or link regarding the release of open-source code for the described methodology was found in the paper.
Open Datasets Yes In this section we evaluate all the methods in a few Atari environments [3]: namely, Breakout, Space Invaders, Pong, and Ms Pacman.
Dataset Splits No The paper mentions 'validation sets' in a general discussion about data limitations, but it does not provide specific details on how data was split into training, validation, and test sets (e.g., percentages, sample counts, or explicit splits) for their experiments.
Hardware Specification No No specific hardware details (e.g., GPU/CPU models, processor types, memory amounts) used for running experiments were mentioned in the paper.
Software Dependencies No No specific ancillary software details, such as library names with version numbers (e.g., Python, TensorFlow, PyTorch versions), were found in the paper.
Experiment Setup Yes Accordingly, for Adaptive TD, we use an ensemble of 3 networks trained with the MC target, and confidence intervals at the 95% level. The data for each network in the ensemble is bootstrapped at the rollout level (i.e., we randomly pick rollouts with replacement).