Non-delusional Q-learning and value-iteration

Authors: Tyler Lu, Dale Schuurmans, Craig Boutilier

NeurIPS 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Figure 2: Planning and learning in a grid world with random feature representations. (Left: 4 4 grid using 4 features; Right: 5 5 grid using 5 features.) Here iterations means a full sweep over state-action pairs, except for Q-learning and PCQL, where an iteration is an episode of length 3/(1 γ) = 60 using εGreedy exploration with ε = 0.7. Dark lines: estimated maximum achievable expected value. Light lines: actual expected value achieved by greedy policy.
Researcher Affiliation Industry Tyler Lu Google AI tylerlu@google.com Dale Schuurmans Google AI schuurmans@google.com Craig Boutilier Google AI cboutilier@google.com
Pseudocode Yes Algorithm 1 Policy-Class Value Iteration (PCVI); Algorithm 2 Policy-Class Q-learning (PCQL)
Open Source Code No The paper does not contain any explicit statement or link indicating that the source code for their proposed methods (PCVI, PCQL) is open-source or publicly available.
Open Datasets No The paper describes experiments on a 'simple deterministic grid world' and 'random feature representations' which appear to be custom-generated for the experiments. It does not provide access information (link, DOI, citation) to a publicly available or open dataset.
Dataset Splits No The paper mentions 'εGreedy exploration with ε = 0.7' for training and 'a full sweep over state-action pairs' for iterations, but does not specify dataset splits (e.g., percentages or counts) for training, validation, or testing.
Hardware Specification No The paper does not provide any specific details about the hardware (e.g., CPU, GPU models, memory, cloud instances) used for running the experiments.
Software Dependencies No The paper does not provide specific version numbers for any software dependencies or libraries used in the implementation of their algorithms or experiments.
Experiment Setup Yes Figure 2 mentions 'an iteration is an episode of length 3/(1 γ) = 60 using εGreedy exploration with ε = 0.7'. The paper also states 'linear approximator' and 'random feature representations' were used for the grid world experiments.