Non-delusional Q-learning and value-iteration
Authors: Tyler Lu, Dale Schuurmans, Craig Boutilier
NeurIPS 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Figure 2: Planning and learning in a grid world with random feature representations. (Left: 4 4 grid using 4 features; Right: 5 5 grid using 5 features.) Here iterations means a full sweep over state-action pairs, except for Q-learning and PCQL, where an iteration is an episode of length 3/(1 γ) = 60 using εGreedy exploration with ε = 0.7. Dark lines: estimated maximum achievable expected value. Light lines: actual expected value achieved by greedy policy. |
| Researcher Affiliation | Industry | Tyler Lu Google AI tylerlu@google.com Dale Schuurmans Google AI schuurmans@google.com Craig Boutilier Google AI cboutilier@google.com |
| Pseudocode | Yes | Algorithm 1 Policy-Class Value Iteration (PCVI); Algorithm 2 Policy-Class Q-learning (PCQL) |
| Open Source Code | No | The paper does not contain any explicit statement or link indicating that the source code for their proposed methods (PCVI, PCQL) is open-source or publicly available. |
| Open Datasets | No | The paper describes experiments on a 'simple deterministic grid world' and 'random feature representations' which appear to be custom-generated for the experiments. It does not provide access information (link, DOI, citation) to a publicly available or open dataset. |
| Dataset Splits | No | The paper mentions 'εGreedy exploration with ε = 0.7' for training and 'a full sweep over state-action pairs' for iterations, but does not specify dataset splits (e.g., percentages or counts) for training, validation, or testing. |
| Hardware Specification | No | The paper does not provide any specific details about the hardware (e.g., CPU, GPU models, memory, cloud instances) used for running the experiments. |
| Software Dependencies | No | The paper does not provide specific version numbers for any software dependencies or libraries used in the implementation of their algorithms or experiments. |
| Experiment Setup | Yes | Figure 2 mentions 'an iteration is an episode of length 3/(1 γ) = 60 using εGreedy exploration with ε = 0.7'. The paper also states 'linear approximator' and 'random feature representations' were used for the grid world experiments. |