On the Theory of Reinforcement Learning with Once-per-Episode Feedback
Authors: Niladri Chatterji, Aldo Pacchiano, Peter Bartlett, Michael Jordan
NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Theoretical | We study a theory of reinforcement learning (RL) in which the learner receives binary feedback only once at the end of an episode. While this is an extreme test case for theory, it is also arguably more representative of real-world applications than the traditional requirement in RL practice that the learner receive feedback at every time step. Indeed, in many real-world applications of reinforcement learning, such as self-driving cars and robotics, it is easier to evaluate whether a learner s complete trajectory was either good or bad, but harder to provide a reward signal at each step. To show that learning is possible in this more challenging setting, we study the case where trajectory labels are generated by an unknown parametric model, and provide a statistically and computationally efficient algorithm that achieves sublinear regret. |
| Researcher Affiliation | Collaboration | Niladri S. Chatterji Stanford University niladri@cs.stanford.edu Aldo Pacchiano Microsoft Research apacchiano@microsoft.com Peter L. Bartlett UC Berkeley peter@berkeley.edu Michael I. Jordan UC Berkeley jordan@cs.berkeley.edu |
| Pseudocode | Yes | Algorithm 1: UCBVI with trajectory labels. ... Algorithm 2: Find exploration mixture. ... Algorithm 3: UCBVI with trajectory labels and added exploration. |
| Open Source Code | Yes | Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? [Yes] See Section F. |
| Open Datasets | No | The provided text focuses on theoretical aspects of reinforcement learning and defines a general Markov Decision Process (MDP) framework. It does not mention any specific named datasets (e.g., CIFAR-10, ImageNet) or provide links, DOIs, or citations to publicly available datasets used for training or evaluation. While it mentions experiments in an appendix not provided in the prompt, the text itself does not describe any specific dataset or its accessibility. |
| Dataset Splits | No | The provided text does not contain any specific details regarding training, validation, or test dataset splits (e.g., percentages, sample counts, or references to predefined splits). The paper is primarily theoretical, focusing on algorithms and regret bounds, rather than empirical data partitioning. |
| Hardware Specification | No | The provided text does not include any specific details about the hardware used for experiments, such as GPU models, CPU types, or memory specifications. There are no mentions of specific computing environments or machines. |
| Software Dependencies | No | The provided text does not list any specific software dependencies with version numbers (e.g., Python, PyTorch, specific solvers, or libraries). The focus is on the mathematical algorithms and their theoretical properties. |
| Experiment Setup | No | The provided text does not contain specific experimental setup details, such as hyperparameter values (e.g., learning rate, batch size, number of epochs) or training configurations. While algorithms have parameters, these are theoretical rather than concrete values for an empirical setup. |