Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Reinventing Policy Iteration under Time Inconsistency
Authors: Nixie S Lesmana, Huangyuan Su, Chi Seng Pun
TMLR 2022 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this paper, we consider an infinite-horizon TIC RL setting and formally present an alternative type of optimality drawn from game theory, i.e., subgame perfect equilibrium (SPE), that attempts to resolve the aforementioned questions. We first analyze standard PI under the SPE type of optimality, revealing its merits and insufficiencies. Drawing on these observations, we propose backward Q-learning (bwd Q), a new algorithm in the approximate PI family that targets SPE policy under non-exponentially discounted reward functions. Finally, with two TIC gridworld environments, we demonstrate the implications of our theoretical findings on the behavior of bwd Q and other approximate PI variants. In this section, we illustrate the behaviour of bwd Q in two TIC Gridworld environments: (i) Deterministic (D), by reusing our motivating example in Figure 1, which has been shown to exhibit future deviations, and (ii) Stochastic (S), by injecting some random noise into state 9 s transition in (D). For our comparative study, we consider as benchmarks two approximate PI variants that also target SPE policy under general-discounting objectives, namely standard PI with Monte Carlo (MC) and sophisticated EU (soph EU) from Evans et al. (2016). |
| Researcher Affiliation | Academia | Nixie S. Lesmana EMAIL School of Physical and Mathematical Sciences Nanyang Technological University Huangyuan Su EMAIL School of Computer Science Carnegie Mellon University Chi Seng Pun EMAIL School of Physical and Mathematical Sciences Nanyang Technological University |
| Pseudocode | Yes | We summarize this section with a pseudocode in Algorithm 1, where lines 11, 18 20 capture our backward conditioning and lines 12 17 capture the TIC-adjusted TD evaluation5. Algorithm 1 Backward Q-learning (bwd Q) Algorithm 2 On-policy Monte Carlo Control (MC) Algorithm 3 Sophisticated Expected-Utility Agent (soph EU) |
| Open Source Code | No | We refer to the sourcecode in https://github.com/dennybritz/reinforcement-learning prior to our hyperbolic-discounting modification. |
| Open Datasets | No | In this section, we illustrate the behaviour of bwd Q in two TIC Gridworld environments: (i) Deterministic (D), by reusing our motivating example in Figure 1, which has been shown to exhibit future deviations, and (ii) Stochastic (S), by injecting some random noise into state 9 s transition in (D). |
| Dataset Splits | No | The paper describes using custom-designed 'Gridworld environments' for its experiments. While it details the setup of these environments and how experiments are conducted (e.g., '10 experiments are conducted and each consists of 50 random seeds'), there is no explicit mention of training, testing, or validation dataset splits, as the environment involves agent-environment interaction rather than static datasets. |
| Hardware Specification | No | The paper does not provide specific details about the hardware used for running the experiments. It mentions 'two TIC Gridworld environments' for demonstration but no information on the computational resources. |
| Software Dependencies | No | The paper does not specify any software dependencies with version numbers. It mentions 'Python 3.8' in a general context in a proof, but not as an explicit dependency for their implementation. No other libraries or frameworks are specified with versions. |
| Experiment Setup | Yes | For each pair of algorithm and environment, hyperparameters are informally selected from the sets α, αQ {.2, .3, .4, .5}, αr {.7, .8, .9, 1.0}, ϵ {.01, .03, .05, .07, .1} with the following criteria in mind: (i) small overtaking-mean i 21, (ii) small stdev-shade on the Q-value learning curves at s = 9, 21, and (iii) identifiable i 9, i 21 (i.e. reducing the overlapping frequencies between two contending actions mean Q-value learning curves); see Figure 8(a)-(b) for relatively bad instances. For all environments and algorithms, we set T = 100; larger episode truncation does not affect much our experiment results. We summarize our final choice of hyperparameters in Table 2. Table 2: Hyperparameters (ϵ, T, αQ/α, αr) MC soph EU bwd Q (D) (.07, 100, -, -) (.07, 100, .4, -) (.07, 100, .4, 1.0) (S) (.07, 100, -, -) (.07, 100, .4, -) (.07, 100, .4, .9) |