Reinforcement Learning with Feedback Graphs

Authors: Christoph Dann, Yishay Mansour, Mehryar Mohri, Ayush Sekhari, Karthik Sridharan

NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Theoretical We study RL in the tabular MDP setting where the agent receives additional observations per step in the form of transitions samples. ... We formalize this setting using a feedback graph over state-action pairs and show that model-based algorithms can incorporate additional observations for more sample-efficient learning. We give a regret bound that predominantly depends on the size of the maximum acyclic subgraph of the feedback graph... Our main contributions, summarized in Table 1, are: We prove that, by incorporating the additional observations into the model estimation step, existing model-based RL algorithms [11, 12] can achieve regret and sample-complexity bounds that scale with the mas-number µ of the feedback graph... We give a lower bound on the regret (Appendix B)... We present an algorithm that overcomes the above challenges for the MDP setting and achieves a sample complexity bound that scales with the more favorable domination number γ in the leading term (Section 5).
Researcher Affiliation Collaboration Christoph Dann Google Research cdann@cdann.net Yishay Mansour Tel Aviv University and Google Research mansour.yishay@gmail.com Mehryar Mohri Google Research and Courant Institute of Math. Sciences mohri@google.com Ayush Sekhari Cornell University as3663@cornell.edu Karthik Sridharan Cornell University ks999@cornell.edu
Pseudocode Yes Algorithm 1: Optimistic model-based RL; Algorithm 2: Sample Episode( , s1, D)
Open Source Code No The paper does not provide any explicit statements or links indicating that source code for the methodology described is publicly available.
Open Datasets No The paper is theoretical and focuses on mathematical proofs and algorithms; it does not mention training on specific public datasets for empirical evaluation.
Dataset Splits No The paper is theoretical and does not describe empirical experiments, therefore no dataset splits (training, validation, test) are mentioned.
Hardware Specification No The paper is theoretical and does not describe any experimental setup or the specific hardware used to run experiments.
Software Dependencies No The paper is theoretical and does not describe any experimental setup, hence no specific software dependencies with version numbers are mentioned.
Experiment Setup No The paper is theoretical and focuses on algorithms and proofs; it does not include details about an experimental setup, hyperparameters, or training configurations.