A Simple Reward-free Approach to Constrained Reinforcement Learning

Authors: Sobhan Miryoosefi, Chi Jin

ICML 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Theoretical This paper bridges reward-free RL and constrained RL. Particularly, we propose a simple meta-algorithm such that given any reward-free RL oracle, the approachability and constrained RL problems can be directly solved with negligible overheads in sample complexity. Utilizing the existing reward-free RL solvers, our framework provides sharp sample complexity results for constrained RL in the tabular MDP setting, matching the best existing results up to a factor of horizon dependence; our framework directly extends to a setting of tabular two-player Markov games, and gives a new result for constrained RL with linear function approximation.
Researcher Affiliation Academia 1Princeton University. Correspondence to: Sobhan Miryoosefi <miryoosefi@cs.princeton.edu>.
Pseudocode Yes Algorithm 1 Meta-algorithm for VMDPs... Algorithm 2 Meta-algorithm for VMGs... Algorithm 3 Solving Constrained RL Using Approachability... Algorithm 4 Online gradient ascent (OGA)... Algorithm 5 VI-Zero: Exploration Phase... Algorithm 6 Reward-Free RL for Linear VMDPs: Exploration Phase... Algorithm 7 Reward-Free RL for Linear VMDPs: Planning Phase... Algorithm 8 VI-Zero for VMGs: Exploration Phase
Open Source Code No The paper does not provide any statement or link indicating that the source code for the described methodology is publicly available.
Open Datasets No The paper focuses on theoretical analysis of algorithms in different settings (e.g., 'tabular MDP setting', 'linear function approximation setting', 'Vector-valued Markov games') but does not mention specific datasets used for training or provide access information for them.
Dataset Splits No The paper is theoretical and analyzes sample complexity. It does not mention any dataset splits (training, validation, test) for experimental reproduction.
Hardware Specification No The paper is theoretical and does not describe any specific hardware used to run experiments.
Software Dependencies No The paper is theoretical and does not list specific software dependencies with version numbers for implementation or experimental setup.
Experiment Setup No The paper is theoretical and focuses on algorithmic design and analysis. It mentions 'Hyperparameters: learning rate ηt' within algorithm definitions, but these are abstract parameters of the theoretical algorithms, not concrete settings for an empirical experiment. It does not provide specific hyperparameter values or training configurations for any experimental setup.