On the Convergence of SARSA with Linear Function Approximation
Authors: Shangtong Zhang, Remi Tachet Des Combes, Romain Laroche
ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We use a diagnostic MDP from Gordon (1996) (Figure 1) to illustrate the chattering of linear SARSA. Gordon (1996) tested the ϵ-greedy policy (2), which is not continuous. We further test the ϵ-softmax policy (3), whose Lipschitz constant is inversely proportional to the temperature ι. When ι approaches 0, the ϵ-softmax policy approaches the ϵ-greedy policy. We run Algorithm 1 in this MDP with CΓ = ∞, i.e., there is no projection. Following Gordon (1996), we set ϵ = 0.1, γ = 1.0, and αt = 0.01/t. As discussed in Gordon (1996), using a smaller discount factor or a decaying learning rate only slows down the chattering but the chattering always occurs. |
| Researcher Affiliation | Collaboration | 1Department of Computer Science, University of Virginia, United States 2Alpaca ML 3Unemployed. Correspondence to: Shangtong Zhang <shangtong@virginia.edu>. |
| Pseudocode | Yes | Algorithm 1 SARSA with linear function approximation |
| Open Source Code | No | The paper does not contain any statement or link indicating that the source code for the described methodology is publicly available. |
| Open Datasets | Yes | We use a diagnostic MDP from Gordon (1996) (Figure 1) to illustrate the chattering of linear SARSA. |
| Dataset Splits | No | The paper describes experiments on a diagnostic MDP (reinforcement learning environment) rather than a pre-split dataset. It does not mention any training, validation, or test data splits. |
| Hardware Specification | No | The paper does not provide any specific hardware details such as GPU models, CPU types, or memory specifications used for running the experiments. |
| Software Dependencies | No | The paper does not list specific software dependencies with version numbers (e.g., Python, PyTorch, or other libraries). |
| Experiment Setup | Yes | Following Gordon (1996), we set ϵ = 0.1, γ = 1.0, and αt = 0.01/t. As discussed in Gordon (1996), using a smaller discount factor or a decaying learning rate only slows down the chattering but the chattering always occurs. We further fix ι to be 0.01 and test reward with different magnitudes. |