A Definition of Continual Reinforcement Learning
Authors: David Abel, Andre Barreto, Benjamin Van Roy, Doina Precup, Hado P. van Hasselt, Satinder Singh
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We present a visual of this domain in Figure 2(a), and conduct a simple experiment contrasting the performance of 휖-greedy continual Q-learning (blue) that uses a constant step-size parameter of 훼= 0.1, with a convergent Q-learning (green) that anneals its step size parameter over time to zero. Both use 휖= 0.15, and we set the number of underlying MDPs to 푛= 10. We present the average reward with 95% confidence intervals, averaged over 250 runs, in Figure 2(b). |
| Researcher Affiliation | Industry | David Abel dmabel@google.com Google Deep Mind Andre Barreto andrebarreto@google.com Google Deep Mind Benjamin Van Roy benvanroy@google.com Google Deep Mind Doina Precup doinap@google.com Google Deep Mind Hado van Hasselt hado@google.com Google Deep Mind Satinder Singh baveja@google.com Google Deep Mind |
| Pseudocode | No | The paper primarily presents theoretical definitions, theorems, and proofs, and does not include any pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not include any explicit statements about releasing source code or links to a code repository. |
| Open Datasets | No | The paper describes using a 'switching MDP environment from Luketina et al. [33]' for its example experiment. While this refers to prior work, it defines the environment setup rather than explicitly naming or providing access information for a public dataset that can be downloaded or formally cited like a benchmark dataset. |
| Dataset Splits | No | The paper describes an experiment where average reward is calculated over 250 runs, but it does not specify explicit training, validation, or test dataset splits. |
| Hardware Specification | No | The paper does not provide any specific details about the hardware (e.g., GPU models, CPU types, memory) used to run the experiments. |
| Software Dependencies | No | The paper describes algorithms like Q-learning, but does not specify any software dependencies with version numbers, such as programming languages, libraries, or frameworks (e.g., Python, PyTorch, TensorFlow versions). |
| Experiment Setup | Yes | We... conduct a simple experiment contrasting the performance of 휖-greedy continual Q-learning (blue) that uses a constant step-size parameter of 훼= 0.1, with a convergent Q-learning (green) that anneals its step size parameter over time to zero. Both use 휖= 0.15, and we set the number of underlying MDPs to 푛= 10. |