Provable Benefit of Multitask Representation Learning in Reinforcement Learning
Authors: Yuan Cheng, Songtao Feng, Jing Yang, Hong Zhang, Yingbin Liang
NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Theoretical | As representation learning becomes a powerful technique to reduce sample complexity in reinforcement learning (RL) in practice, theoretical understanding of its advantage is still limited. In this paper, we theoretically characterize the benefit of representation learning under the low-rank Markov decision process (MDP) model. ... To the best of our knowledge, this is the first theoretical study that characterizes the benefit of representation learning in exploration-based reward-free multitask RL for both upstream and downstream tasks. ... If you ran experiments... [N/A] Did you include the code... [N/A] Did you specify all the training details... [N/A] Did you report error bars... [N/A] Did you include the total amount of compute... [N/A] |
| Researcher Affiliation | Academia | Yuan Cheng University of Science and Technology of China cy16@mail.ustc.edu.cn Songtao Feng The Ohio State University feng.1359@osu.edu Jing Yang The Pennsylvania State University yangjing@psu.edu Hong Zhang University of Science and Technology of China zhangh@ustc.edu.cn Yingbin Liang The Ohio State University liang.889@osu.edu |
| Pseudocode | Yes | We first describe our proposed algorithm REFUEL depicted in Algorithm 1. ... Although our algorithm shares similar design principles as traditional algorithms for linear MDPs (Jin et al., 2021), it differs from them significantly as described in the following, due to the misspecification of representation from upstream and the general rather than linear reward function adopted. For ease of exposition, we define Bellman operator Bh as (Bhf)(s, a) = rh(s, a) + (P ( ,T +1) h f)(s, a) for any f : S A 7 R. The main body of the algorithm consists of a backward iteration over steps. In each iteration h, the agent executes the following main steps. ... We present our pessimistic value iteration algorithm for downstream offline RL called DOFRL and defer detailed Algorithm 2 to Appendix C. ... We present our downstream online RL algorithm called DONRL and defer the detailed Algorithm 3 in Appendix D |
| Open Source Code | No | The paper does not include an unambiguous statement or link indicating that the source code for the described methodology is publicly available. The ethics statement also indicates N/A for code. |
| Open Datasets | No | This is a theoretical paper focusing on mathematical proofs and algorithms without empirical studies. The ethics statement confirms 'N/A' for questions related to running experiments, including training details. |
| Dataset Splits | No | This is a theoretical paper focusing on mathematical proofs and algorithms without empirical studies. The ethics statement confirms 'N/A' for questions related to running experiments, including data splits like validation. |
| Hardware Specification | No | This is a theoretical paper and does not describe any specific hardware used for experiments. The ethics statement confirms 'N/A' for compute resources. |
| Software Dependencies | No | This is a theoretical paper and does not mention any specific software dependencies with version numbers for experimental reproducibility. The ethics statement confirms 'N/A' for running experiments. |
| Experiment Setup | No | This is a theoretical paper that provides algorithms and theoretical guarantees, but it does not detail an experimental setup with hyperparameters or system-level training settings. The ethics statement indicates 'N/A' for specifying training details. |