POLITEX: Regret Bounds for Policy Iteration using Expert Prediction

Authors: Yasin Abbasi-Yadkori, Peter Bartlett, Kush Bhatia, Nevena Lazic, Csaba Szepesvari, Gellert Weisz

ICML 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments on a queuing problem confirm that POLITEX is competitive with some of its alternatives, while preliminary results on Ms Pacman (one of the standard Atari benchmark problems) confirm the viability of POLITEX beyond linear function approximation.
Researcher Affiliation Collaboration 1Adobe Research 2UC Berkeley 3Google Brain 4Deep Mind. Correspondence to: Nevena Lazic <nevena@google.com>.
Pseudocode Yes Algorithm 1 POLITEX: POLicy ITeration using EXperts
Open Source Code No The paper does not provide any explicit statement about releasing the source code for the described methodology or a link to a code repository.
Open Datasets Yes We first study the performance of POLITEX with linear function approximation on the 4-dimensional and 8-dimensional queueing network problems described in de Farias & Van Roy (2003) (Figures 6 and 7). ... We compare a version of POLITEX to DQN (Mnih et al., 2013) on a standard Atari environment running Ms Pacman.
Dataset Splits No The paper describes the experimental setup and duration (e.g., '2000 phases of length τ = E') but does not specify traditional dataset splits (e.g., percentages or counts for training, validation, and testing sets) in the context of continuous reinforcement learning.
Hardware Specification No The paper does not specify any particular hardware (e.g., GPU models, CPU types) used for running the experiments. It mentions 'standard Atari environment' which implies a simulator, but no hardware specifics.
Software Dependencies No The paper mentions algorithms like LSPE, TD(0), SOLO FTRL, and DQN, but it does not specify software packages or libraries with version numbers (e.g., TensorFlow 2.x, PyTorch 1.x, scikit-learn 0.x) used for implementation.
Experiment Setup Yes For all policies, we bias the covariance of the value functions with β = 0.1. For LSPI and POLITEX, experiment with η = k/T, for k ∈ {1, 5, 10, 20, 100, 500, 1000, 2000, 4000}; the value k = 1 was best. ... We initialize to empty queues and run policies for E = 2000 phases of length τ = E.