TD Convergence: An Optimization Perspective

Authors: Kavosh Asadi, Shoham Sabach, Yao Liu, Omer Gottesman, Rasool Fakoor

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Theoretical We study the convergence behavior of the celebrated temporal-difference (TD) learning algorithm. By looking at the algorithm through the lens of optimization, we first argue that TD can be viewed as an iterative optimization algorithm where the function to be minimized changes per iteration. By carefully investigating the divergence displayed by TD on a classical counter example, we identify two forces that determine the convergent or divergent behavior of the algorithm. We next formalize our discovery in the linear TD setting with quadratic loss and prove that convergence of TD hinges on the interplay between these two forces. We extend this optimization perspective to prove convergence of TD in a much broader setting than just linear approximation and squared loss. Our results provide a theoretical explanation for the successful application of TD in reinforcement learning.
Researcher Affiliation Collaboration Kavosh Asadi Amazon Shoham Sabach Amazon & Technion Yao Liu Amazon Omer Gottesman Amazon Rasool Fakoor Amazon
Pseudocode Yes A pseudo-code of both algorithms is presented in Algorithms 1 and 2.
Open Source Code No The paper does not provide any explicit statements about releasing source code or links to a code repository for the described methodology.
Open Datasets No The paper is theoretical and does not conduct experiments on a specific dataset. It analyzes a classical counter example conceptually rather than through empirical training with data.
Dataset Splits No The paper is theoretical and does not involve empirical experiments with datasets, thus no training, validation, or test splits are provided.
Hardware Specification No The paper is theoretical and does not describe any experimental setup or the hardware used for computations.
Software Dependencies No The paper is theoretical and does not mention any specific software dependencies with version numbers required for reproducibility.
Experiment Setup No The paper is theoretical and focuses on mathematical analysis and convergence proofs, not empirical experiments. Therefore, no experimental setup details such as hyperparameters or training configurations are provided.