Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Reinventing Policy Iteration under Time Inconsistency

Authors: Nixie S Lesmana, Huangyuan Su, Chi Seng Pun

TMLR 2022 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this paper, we consider an infinite-horizon TIC RL setting and formally present an alternative type of optimality drawn from game theory, i.e., subgame perfect equilibrium (SPE), that attempts to resolve the aforementioned questions. We first analyze standard PI under the SPE type of optimality, revealing its merits and insufficiencies. Drawing on these observations, we propose backward Q-learning (bwd Q), a new algorithm in the approximate PI family that targets SPE policy under non-exponentially discounted reward functions. Finally, with two TIC gridworld environments, we demonstrate the implications of our theoretical findings on the behavior of bwd Q and other approximate PI variants. In this section, we illustrate the behaviour of bwd Q in two TIC Gridworld environments: (i) Deterministic (D), by reusing our motivating example in Figure 1, which has been shown to exhibit future deviations, and (ii) Stochastic (S), by injecting some random noise into state 9 s transition in (D). For our comparative study, we consider as benchmarks two approximate PI variants that also target SPE policy under general-discounting objectives, namely standard PI with Monte Carlo (MC) and sophisticated EU (soph EU) from Evans et al. (2016).
Researcher Affiliation	Academia	Nixie S. Lesmana EMAIL School of Physical and Mathematical Sciences Nanyang Technological University Huangyuan Su EMAIL School of Computer Science Carnegie Mellon University Chi Seng Pun EMAIL School of Physical and Mathematical Sciences Nanyang Technological University
Pseudocode	Yes	We summarize this section with a pseudocode in Algorithm 1, where lines 11, 18 20 capture our backward conditioning and lines 12 17 capture the TIC-adjusted TD evaluation5. Algorithm 1 Backward Q-learning (bwd Q) Algorithm 2 On-policy Monte Carlo Control (MC) Algorithm 3 Sophisticated Expected-Utility Agent (soph EU)
Open Source Code	No	We refer to the sourcecode in https://github.com/dennybritz/reinforcement-learning prior to our hyperbolic-discounting modification.
Open Datasets	No	In this section, we illustrate the behaviour of bwd Q in two TIC Gridworld environments: (i) Deterministic (D), by reusing our motivating example in Figure 1, which has been shown to exhibit future deviations, and (ii) Stochastic (S), by injecting some random noise into state 9 s transition in (D).
Dataset Splits	No	The paper describes using custom-designed 'Gridworld environments' for its experiments. While it details the setup of these environments and how experiments are conducted (e.g., '10 experiments are conducted and each consists of 50 random seeds'), there is no explicit mention of training, testing, or validation dataset splits, as the environment involves agent-environment interaction rather than static datasets.
Hardware Specification	No	The paper does not provide specific details about the hardware used for running the experiments. It mentions 'two TIC Gridworld environments' for demonstration but no information on the computational resources.
Software Dependencies	No	The paper does not specify any software dependencies with version numbers. It mentions 'Python 3.8' in a general context in a proof, but not as an explicit dependency for their implementation. No other libraries or frameworks are specified with versions.
Experiment Setup	Yes	For each pair of algorithm and environment, hyperparameters are informally selected from the sets α, αQ {.2, .3, .4, .5}, αr {.7, .8, .9, 1.0}, ϵ {.01, .03, .05, .07, .1} with the following criteria in mind: (i) small overtaking-mean i 21, (ii) small stdev-shade on the Q-value learning curves at s = 9, 21, and (iii) identifiable i 9, i 21 (i.e. reducing the overlapping frequencies between two contending actions mean Q-value learning curves); see Figure 8(a)-(b) for relatively bad instances. For all environments and algorithms, we set T = 100; larger episode truncation does not affect much our experiment results. We summarize our final choice of hyperparameters in Table 2. Table 2: Hyperparameters (ϵ, T, αQ/α, αr) MC soph EU bwd Q (D) (.07, 100, -, -) (.07, 100, .4, -) (.07, 100, .4, 1.0) (S) (.07, 100, -, -) (.07, 100, .4, -) (.07, 100, .4, .9)