Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
A Definition of Continual Reinforcement Learning
Authors: David Abel, Andre Barreto, Benjamin Van Roy, Doina Precup, Hado P. van Hasselt, Satinder Singh
NeurIPS 2023 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We present a visual of this domain in Figure 2(a), and conduct a simple experiment contrasting the performance of ν-greedy continual Q-learning (blue) that uses a constant step-size parameter of νΌ= 0.1, with a convergent Q-learning (green) that anneals its step size parameter over time to zero. Both use ν= 0.15, and we set the number of underlying MDPs to ν= 10. We present the average reward with 95% conο¬dence intervals, averaged over 250 runs, in Figure 2(b). |
| Researcher Affiliation | Industry | David Abel EMAIL Google Deep Mind Andre Barreto EMAIL Google Deep Mind Benjamin Van Roy EMAIL Google Deep Mind Doina Precup EMAIL Google Deep Mind Hado van Hasselt EMAIL Google Deep Mind Satinder Singh EMAIL Google Deep Mind |
| Pseudocode | No | The paper primarily presents theoretical definitions, theorems, and proofs, and does not include any pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not include any explicit statements about releasing source code or links to a code repository. |
| Open Datasets | No | The paper describes using a 'switching MDP environment from Luketina et al. [33]' for its example experiment. While this refers to prior work, it defines the environment setup rather than explicitly naming or providing access information for a public dataset that can be downloaded or formally cited like a benchmark dataset. |
| Dataset Splits | No | The paper describes an experiment where average reward is calculated over 250 runs, but it does not specify explicit training, validation, or test dataset splits. |
| Hardware Specification | No | The paper does not provide any specific details about the hardware (e.g., GPU models, CPU types, memory) used to run the experiments. |
| Software Dependencies | No | The paper describes algorithms like Q-learning, but does not specify any software dependencies with version numbers, such as programming languages, libraries, or frameworks (e.g., Python, PyTorch, TensorFlow versions). |
| Experiment Setup | Yes | We... conduct a simple experiment contrasting the performance of ν-greedy continual Q-learning (blue) that uses a constant step-size parameter of νΌ= 0.1, with a convergent Q-learning (green) that anneals its step size parameter over time to zero. Both use ν= 0.15, and we set the number of underlying MDPs to ν= 10. |