Policy Evaluation for Variance in Average Reward Reinforcement Learning
Authors: Shubhada Agrawal, Prashanth L A, Siva Theja Maguluri
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Theoretical | We design a temporal-difference (TD) type algorithm tailored for policy evaluation in this context. Our algorithm is based on linear stochastic approximation of an equivalent formulation of the asymptotic variance in terms of the solution of the Poisson equation. We consider both the tabular and linear function approximation settings, and establish O(1/k) finite time convergence rate, where k is the number of steps of the algorithm. and We develop the first finite sample error bounds for the policy evaluation problem for asymptotic variance in a tabular setting, proving O(1/k) rate of convergence for the mean-squared error, where k is the time step. Here, O( ) notation hides log k and lower order dependencies. |
| Researcher Affiliation | Academia | 1H. Milton Stewart School of Industrial and Systems Engineering, Georgia Institute of Technology, USA. 2Department of Computer Science and Engineering, Indian Institute of Technology Madras, India. Correspondence to: Shubhada Agrawal <sagrawal362@gatech.edu>. |
| Pseudocode | Yes | Algorithm 1: Policy Evaluation: Tabular Setting |
| Open Source Code | No | The paper does not include any statement or link providing concrete access to the source code for the described methodology. |
| Open Datasets | No | The paper is theoretical and does not describe experiments performed on a specific public dataset. Therefore, no information about concrete access to a publicly available dataset is provided. |
| Dataset Splits | No | The paper is theoretical and does not involve empirical evaluation with datasets, thus no information on training/validation/test splits is provided. |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., GPU/CPU models, memory) used for running experiments, as it is a theoretical work without empirical evaluation. |
| Software Dependencies | No | The paper does not provide specific ancillary software details with version numbers, as it focuses on theoretical algorithm design and analysis. |
| Experiment Setup | No | The paper does not provide specific experimental setup details such as concrete hyperparameter values or training configurations, as it is a theoretical paper without empirical experiments. |