Warm-Start Actor-Critic: From Approximation Error to Sub-optimality Gap
Authors: Hang Wang, Sen Lin, Junshan Zhang
ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We consider experiments over the Gridworld benchmark task. In particular, we consider the following sizes of the grid to represent different problem complexity, i.e., 10 10, 15 15 and 20 20. |
| Researcher Affiliation | Academia | 1Department of ECE, University of California, Davis, CA, USA 2Department of ECE, The Ohio State University, Columbus, OH, USA. |
| Pseudocode | No | No explicit pseudocode or algorithm block was found. The methods are described using mathematical equations and prose. |
| Open Source Code | No | No statement regarding the release of open-source code or a link to a code repository was found. |
| Open Datasets | No | We consider experiments over the Gridworld benchmark task. In particular, we consider the following sizes of the grid to represent different problem complexity, i.e., 10 10, 15 15 and 20 20. - While the paper mentions a benchmark task, it does not provide specific access information (link, DOI, formal citation with author/year, or specific file names) for the dataset used for training. |
| Dataset Splits | No | No explicit information on training/validation/test splits (e.g., percentages, sample counts, or references to predefined splits) was found. |
| Hardware Specification | No | No specific hardware details (e.g., GPU/CPU models, memory amounts, or cloud instance types) used for running experiments were provided. |
| Software Dependencies | No | No specific software dependencies with version numbers were mentioned. |
| Experiment Setup | Yes | The discounting factor is set as γ = 0.9. We consider the grid with 10 rows and 10 columns such that the state space has 100 states. ... we let m be large enough, e.g., m = 1000, in the Critic update Eqn. (28). ... we study the Critic update with finite time Bellman evaluation, e.g., m = 500, 50, 20, 5. ... we add the uniform noise e(t) in the value function with different bias, e.g.,E[e(t)] = 0, 0.5, 1, 1. ... with probability p, the agent will choose the action follow the current policy while with probability 1 p, the agent will choose a random action. By setting different p, we show in Fig. 7 that the approximation error in the Actor update may significantly degrade the learning performance. |