Per-Decision Option Discounting

Authors: Anna Harutyunyan, Peter Vrancx, Philippe Hamel, Ann Nowe, Doina Precup

ICML 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We verify the shape of the bounds empirically on a classical task. Our results imply that in addition to extending the agent s horizon, time dilation can be a tool for better estimation of value functions. and We illustrate empirically the key ideas of this paper: 1. the bias-variance tradeoff obtained in Theorem 2; and 2. the ability of time dilation to extend the agent s horizon and preserve far-sighted policies, irrespectively of the size of the environment; Our approximate planning setting is similar to that described by Jiang et al. (2015). Following that work, and since the reward model is unaffected by our proposal, we do not estimate the reward model in these experiments, and instead use its true value (which can be computed exactly in these tasks). Finally, in 7.3, we evaluate the learning performance on an illustrative task with characteristic properties.
Researcher Affiliation Collaboration 1DeepMind, London, UK 2Vrije Universiteit Brussel, Brussels, Belgium 3PROWLER.io, Cambridge, UK.
Pseudocode No No structured pseudocode or algorithm blocks (clearly labeled algorithm sections or code-like formatted procedures) were found.
Open Source Code No No concrete access to source code (specific repository link, explicit code release statement, or code in supplementary materials) for the methodology described in this paper was found.
Open Datasets Yes We investigate whether the analytical bias-variance tradeoff can be observed in practice in the control setting on the classical Four Rooms domain (Sutton et al., 1999).
Dataset Splits No No specific dataset split information (exact percentages, sample counts, citations to predefined splits, or detailed splitting methodology) needed to reproduce the data partitioning was found.
Hardware Specification No No specific hardware details (exact GPU/CPU models, processor types with speeds, memory amounts, or detailed computer specifications) used for running its experiments were found.
Software Dependencies No No specific ancillary software details (e.g., library or solver names with version numbers like Python 3.8, CPLEX 12.4) needed to replicate the experiment were found.
Experiment Setup Yes To evaluate the effects of varied option duration, we add ϵ-noise to the typically deterministic option policies. That is: an option takes an action recommended by its original πo w.p. 1 ϵ, and a random action w.p. ϵ. To obtain a clear picture, we consider a very noisy case of ϵ = 0.5. For each option o, and for each state s Io, we sample N trajectories to obtain an estimate c P o Γ of P o Γ. We then perform policy iteration w.r.t. the approximate models c P o Γ and the true reward models Ro γr to obtain the approximate optimal policy π bΓ. We take the value of N to be 2 here. For the estimation to be less trivial, we consider ϵ-soft option policies, as described above, with ϵ = 0.05. We use Q-learning over options with call-and-return execution.