Policy Evaluation for Variance in Average Reward Reinforcement Learning

Authors: Shubhada Agrawal, Prashanth L A, Siva Theja Maguluri

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Theoretical We design a temporal-difference (TD) type algorithm tailored for policy evaluation in this context. Our algorithm is based on linear stochastic approximation of an equivalent formulation of the asymptotic variance in terms of the solution of the Poisson equation. We consider both the tabular and linear function approximation settings, and establish O(1/k) finite time convergence rate, where k is the number of steps of the algorithm. and We develop the first finite sample error bounds for the policy evaluation problem for asymptotic variance in a tabular setting, proving O(1/k) rate of convergence for the mean-squared error, where k is the time step. Here, O( ) notation hides log k and lower order dependencies.
Researcher Affiliation Academia 1H. Milton Stewart School of Industrial and Systems Engineering, Georgia Institute of Technology, USA. 2Department of Computer Science and Engineering, Indian Institute of Technology Madras, India. Correspondence to: Shubhada Agrawal <sagrawal362@gatech.edu>.
Pseudocode Yes Algorithm 1: Policy Evaluation: Tabular Setting
Open Source Code No The paper does not include any statement or link providing concrete access to the source code for the described methodology.
Open Datasets No The paper is theoretical and does not describe experiments performed on a specific public dataset. Therefore, no information about concrete access to a publicly available dataset is provided.
Dataset Splits No The paper is theoretical and does not involve empirical evaluation with datasets, thus no information on training/validation/test splits is provided.
Hardware Specification No The paper does not provide specific hardware details (e.g., GPU/CPU models, memory) used for running experiments, as it is a theoretical work without empirical evaluation.
Software Dependencies No The paper does not provide specific ancillary software details with version numbers, as it focuses on theoretical algorithm design and analysis.
Experiment Setup No The paper does not provide specific experimental setup details such as concrete hyperparameter values or training configurations, as it is a theoretical paper without empirical experiments.