Warm-Start Actor-Critic: From Approximation Error to Sub-optimality Gap

Authors: Hang Wang, Sen Lin, Junshan Zhang

ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We consider experiments over the Gridworld benchmark task. In particular, we consider the following sizes of the grid to represent different problem complexity, i.e., 10 10, 15 15 and 20 20.
Researcher Affiliation Academia 1Department of ECE, University of California, Davis, CA, USA 2Department of ECE, The Ohio State University, Columbus, OH, USA.
Pseudocode No No explicit pseudocode or algorithm block was found. The methods are described using mathematical equations and prose.
Open Source Code No No statement regarding the release of open-source code or a link to a code repository was found.
Open Datasets No We consider experiments over the Gridworld benchmark task. In particular, we consider the following sizes of the grid to represent different problem complexity, i.e., 10 10, 15 15 and 20 20. - While the paper mentions a benchmark task, it does not provide specific access information (link, DOI, formal citation with author/year, or specific file names) for the dataset used for training.
Dataset Splits No No explicit information on training/validation/test splits (e.g., percentages, sample counts, or references to predefined splits) was found.
Hardware Specification No No specific hardware details (e.g., GPU/CPU models, memory amounts, or cloud instance types) used for running experiments were provided.
Software Dependencies No No specific software dependencies with version numbers were mentioned.
Experiment Setup Yes The discounting factor is set as γ = 0.9. We consider the grid with 10 rows and 10 columns such that the state space has 100 states. ... we let m be large enough, e.g., m = 1000, in the Critic update Eqn. (28). ... we study the Critic update with finite time Bellman evaluation, e.g., m = 500, 50, 20, 5. ... we add the uniform noise e(t) in the value function with different bias, e.g.,E[e(t)] = 0, 0.5, 1, 1. ... with probability p, the agent will choose the action follow the current policy while with probability 1 p, the agent will choose a random action. By setting different p, we show in Fig. 7 that the approximation error in the Actor update may significantly degrade the learning performance.