Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Convergence Theorems for Entropy-Regularized and Distributional Reinforcement Learning
Authors: Yash Jhaveri, Harley Wiltzer, Patrick Shafto, Marc Bellemare, David Meger
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this section, we demonstrate that the policies learned via the temperature decoupling gambit differ from those learned in ERL, even in the presence of stochastic updates. Figure 3.1 shows a given tristate MDP with two actions (blue: a1; green: a2), as well as learned policies ˆπτ, and ˆπτ,σ estimated with soft Q-learning [18]. Here πref x = U(A) for all x X and γ = 0.9. As this MDP is tabular, Theorem 2.5 implies that the policies πτ, converge as τ 0. Thus, the temperature decoupling gambit is not necessary to guarantee convergence. Yet we see different limiting behavior. As predicted by Theorem 3.9, the estimates ˆπτ,σ converge to πref, , as τ 0. With uniform πref, this is the policy that samples all optimal actions, given a state, with equal probability. As τ 0, the estimates ˆπτ, x0 do converge to a different optimal policy. This difference is in x0, where ˆπτ, x0 collapse to δa1. We take σ = τ 2, in line with Definition 3.7. |
| Researcher Affiliation | Academia | Yash Jhaveri Rutgers University Newark Harley Wiltzer Mila Québec AI Institute Mc Gill University Patrick Shafto Rutgers University Newark Marc G. Bellemare Mila Québec AI Institute Mc Gill University David Meger Mila Québec AI Institute Mc Gill University Equal contribution. Correspondence to EMAIL, EMAIL. CIFAR AI Chair. |
| Pseudocode | No | The paper describes algorithms but does not present them in a structured pseudocode or algorithm block format. For example, it mentions "we define the first algorithm for accurately estimating a reference-optimal return distribution" in the introduction, and Theorem 4.7 outlines an algorithm in text: "First, approximate ζσ, via nopt applications of T σ (control). Second, extract the mean: ˆq σ q σ. Finally, apply Tπ τ neval times, with π = Gτ ˆq σ (evaluation)." |
| Open Source Code | No | The paper does not contain any explicit statement about releasing source code for the methodology described, nor does it provide a direct link to a code repository. It mentions JAX [8] as a scientific computing library used for implementations, but this is a third-party tool. |
| Open Datasets | No | The paper describes using |
| Dataset Splits | No | The paper focuses on theoretical analysis and numerical demonstrations on illustrative MDPs. It does not mention any specific training, test, or validation dataset splits. |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., GPU/CPU models, memory amounts) used for running its experiments or numerical demonstrations. |
| Software Dependencies | No | The paper mentions "Jax [8]" as a scientific computing library and refers to "64-bit precision and 32-bit precision respectively" in numerical demonstrations, but it does not specify version numbers for these or other software components. |
| Experiment Setup | Yes | In this section, we demonstrate that the policies learned via the temperature decoupling gambit differ from those learned in ERL, even in the presence of stochastic updates. Figure 3.1 shows a given tristate MDP with two actions (blue: a1; green: a2), as well as learned policies ˆπτ, and ˆπτ,σ estimated with soft Q-learning [18]. Here πref x = U(A) for all x X and γ = 0.9. ... We take σ = τ 2, in line with Definition 3.7. ... Here γ = 1/2, πref x = U(A) for all x X, and σ = τ 2. We consider τ {10 (2m+1) : m = 0, 1, 2, 3, 4}. Our simulation is a practical implementation of Theorem 4.7. First, we approximate nopt = 1000 iterative applications of our soft Bellman optimality operator at τ (control). Then, we extract ˆq τ, an approximation of q τ, and construct two policies: the BG policy at τ and the BG policy at τ 1/2, both with potential ˆq τ. Next we approximate neval = 1000 iterative applications of our soft Bellman operator (policy evaluation) at temperature τ with the first policy and at temperature τ 1/2 with the second policy. |