reproducibilityindex.ai

Refined Regret for Adversarial MDPs with Linear Function Approximation

Authors: Yan Dai, Haipeng Luo, Chen-Yu Wei, Julian Zimmert

ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Theoretical	We consider learning in an adversarial Markov Decision Process (MDP) where the loss functions can change arbitrarily over K episodes and the state space can be arbitrarily large. We assume that the Q-function of any policy is linear in some known features, that is, a linear function approximation exists. The best existing regret upper bound for this setting (Luo et al., 2021b) is of order e O(K2/3) (omitting all other dependencies), given access to a simulator. This paper provides two algorithms that improve the regret to e O(K) in the same setting. Our first algorithm makes use of a refined analysis of the Follow-the Regularized-Leader (FTRL) algorithm with the log-barrier regularizer. This analysis allows the loss estimators to be arbitrarily negative and might be of independent interest. Our second algorithm develops a magnitude-reduced loss estimator, further removing the polynomial dependency on the number of actions in the first algorithm and leading to the optimal regret bound (up to logarithmic terms and dependency on the horizon).
Researcher Affiliation	Collaboration	Yan Dai 1 IIIS, Tsinghua University 2University of Southern California 3IDSS, MIT 4Google Research.
Pseudocode	Yes	Algorithm 1 Improved Linear-Q Algorithm (Using Log-Barrier Regularizers)Input: Learning rate η, bonus parameter β, MGR parameters γ and ϵ, FTRL regularizer Ψ(p) = P\|A\| i=1 ln 1pi . Algorithm 2 Improved Linear-Q Algorithm (Using Magnitude-Reduced Loss Estimators) Input: Learning rate η, bonus parameter β, MGR parameters γ and ϵ, covariance estimation parameter M. Algorithm 3 Matrix Geometric Resampling Algorithm Input: Policy π, regularization parameter γ, desired precision ϵ. Algorithm 4 Dilated Bonus Calculation in Linear-Q Algorithms Input: Episode k [K], state s S, action a A. Algorithm 5 POLICYCOVER for Linear MDPs Input: Configurations M0, N0, α, failure probability δ. Algorithm 6 Improved Linear MDP Algorithm Input: Learning rate η, POLICYCOVER parameters α and T0, bonus parameter β, covariance estimation parameter γ (0, 14), epoch length W, FTRL regularizer Ψ(p) = PA i=1 ln 1pi , exploration probability δe
Open Source Code	No	The paper does not provide any explicit statements about open-source code availability or links to code repositories.
Open Datasets	No	The paper is theoretical and focuses on regret bounds and algorithm analysis. It does not describe experiments performed on specific datasets or provide access information for any dataset.
Dataset Splits	No	The paper is theoretical and does not involve experimental data splits for training, validation, or testing.
Hardware Specification	No	The paper is theoretical and does not mention any hardware specifications used for running experiments.
Software Dependencies	No	The paper is theoretical and does not list any specific software dependencies with version numbers.
Experiment Setup	No	The paper is theoretical and does not describe an experimental setup with specific hyperparameters or training configurations.