Policy Optimization in Adversarial MDPs: Improved Exploration via Dilated Bonuses
Authors: Haipeng Luo, Chen-Yu Wei, Chung-Wei Lee
NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Theoretical | Policy optimization is a widely-used method in reinforcement learning. Due to its local-search nature, however, theoretical guarantees on global optimality often rely on extra assumptions on the Markov Decision Processes (MDPs) that bypass the challenge of global exploration. To eliminate the need of such assumptions, in this work, we develop a general solution that adds dilated bonuses to the policy update to facilitate global exploration. To showcase the power and generality of this technique, we apply it to several episodic MDP settings with adversarial losses and bandit feedback, improving and generalizing the state-of-the-art. Specifically, in the tabular case, we obtain e O(T) regret where T is the number of episodes, improving the e O(T 2/3) regret bound by [27]. When the number of states is infinite, under the assumption that the state-action values are linear in some low-dimensional features, we obtain e O(T 2/3) regret with the help of a simulator, matching the result of [24] while importantly removing the need of an exploratory policy that their algorithm requires. To our knowledge, this is the first algorithm with sublinear regret for linear function approximation with adversarial losses, bandit feedback, and no exploratory assumptions. Finally, we also discuss how to further improve the regret or remove the need of a simulator using dilated bonuses, when an exploratory policy is available. ... Policy optimization methods are among the most widely-used methods in reinforcement learning. Its empirical success has been demonstrated in various domains such as computer games [26] and robotics [21]. However, due to its local-search nature, global optimality guarantees of policy optimization often rely on unrealistic assumptions to ensure global exploration (see e.g., [1, 3, 24, 30]), making it theoretically less appealing compared to other methods. |
| Researcher Affiliation | Academia | Haipeng Luo Chen-Yu Wei Chung-Wei Lee {haipengl,chenyu.wei,leechung}@usc.edu University of Southern California |
| Pseudocode | Yes | Algorithm 1: Policy Optimization with Dilated Bonuses (Tabular Case)... Algorithm 2: Policy Optimization with Dilated Bonuses (Linear-Q Case)... Algorithm 3: BONUS(t, x, a) |
| Open Source Code | No | No explicit statement about providing open-source code or a link to a code repository for the methodology described was found. The paper mentions an arXiv link for an improved version of the paper and a lecture notes link, neither of which are code repositories. |
| Open Datasets | No | No concrete access information (link, DOI, repository name, formal citation with authors/year, or reference to established benchmark datasets) for a publicly available or open dataset was found. The paper is theoretical and does not describe experiments on specific datasets. |
| Dataset Splits | No | No specific dataset split information (exact percentages, sample counts, citations to predefined splits, or detailed splitting methodology) needed to reproduce data partitioning was found. The paper is theoretical and does not describe empirical experiments with datasets. |
| Hardware Specification | No | No specific hardware details (exact GPU/CPU models, processor types with speeds, memory amounts, or detailed computer specifications) used for running experiments were found. The paper is theoretical and does not describe empirical experiments. |
| Software Dependencies | No | No specific ancillary software details (e.g., library or solver names with version numbers) needed to replicate the experiment were found. The paper is theoretical and does not describe empirical experiments requiring specific software dependencies. |
| Experiment Setup | No | No specific experimental setup details (concrete hyperparameter values, training configurations, or system-level settings) were found in the main text. The paper is theoretical and does not describe empirical experiments with such setups. |