Near-Optimal Adversarial Reinforcement Learning with Switching Costs

Authors: Ming Shi, Yingbin Liang, Ness Shroff

ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Theoretical First, we provide a regret lower-bound that shows that the regret of any algorithm must be larger than Ω((HSA)1/3T 2/3), where T, S, A and H are the number of episodes, states, actions and layers in each episode, respectively. Our lower bound indicates that, due to the fundamental challenge of switching costs in adversarial RL, the best achieved regret (whose dependency on T is O( T)) in static RL with switching costs (as well as adversarial RL without switching costs) is no longer achievable. Moreover, we propose two novel switching-reduced algorithms with regrets that match our lower bound when the transition function is known, and match our lower bound within a small factor of O(H1/3) when the transition function is unknown. Our regret analysis demonstrates the near-optimal performance of them.
Researcher Affiliation Academia Ming Shi, Yingbin Liang, Ness Shroff Department of Electrical and Computer Engineering The Ohio State University Columbus, OH 43210, USA {shi.1796,liang.889,shroff.11}@osu.edu
Pseudocode Yes Algorithm 1 Switching r Educed Episo Dic relative entropy policy Search (SEEDS), Algorithm 2 SEEDS-Unknown Transition (SEEDS-UT)
Open Source Code No Insufficient information. The paper does not provide any statement about making code open source or providing links to a repository.
Open Datasets No Insufficient information. The paper is theoretical and focuses on algorithm design and regret analysis, not empirical evaluation on datasets. Therefore, it does not mention or use any datasets for training.
Dataset Splits No Insufficient information. The paper is theoretical and focuses on algorithm design and regret analysis, not empirical evaluation on datasets. Therefore, it does not mention any dataset splits for training, validation, or testing.
Hardware Specification No Insufficient information. The paper is theoretical and describes algorithms and their regret analysis. It does not mention any specific hardware (like GPUs, CPUs, or cloud resources) used for running experiments.
Software Dependencies No Insufficient information. The paper is theoretical and describes algorithms and their regret analysis. It does not mention any specific software components or libraries with version numbers that would be needed to replicate experiments.
Experiment Setup No Insufficient information. The paper is theoretical and describes algorithms and their regret analysis. It defines parameters for its algorithms (e.g., η, τ, γ) which are theoretical choices for analysis, but it does not detail an experimental setup with concrete hyperparameters, training configurations, or system-level settings.