Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Learning With Options That Terminate Off-Policy
Authors: Anna Harutyunyan, Peter Vrancx, Pierre-Luc Bacon, Doina Precup, Ann Nowé
AAAI 2018 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We validate our algorithm empirically, and show that it holds up to its motivating claims. |
| Researcher Affiliation | Collaboration | Anna Harutyunyan Vrije Universiteit Brussel Brussels, Belgium Peter Vrancx PROWLER.io Cambridge, England Pierre-Luc Bacon Mc Gill University Montreal, Canada Doina Precup Mc Gill University Montreal, Canada Ann Nowe Vrije Universiteit Brussel Brussels, Belgium |
| Pseudocode | Yes | Algorithm 1 presents the forward view of the algorithm underlying this expected operator, for the general case of an evolving sequence of policies (μk)k N. This algorithm is very similar to the recently formalized Q(σ) (Asis et al. 2017; Sutton and Barto 2017), with β being a state-option generalization of 1 σ. |
| Open Source Code | No | The paper does not provide any explicit statements about releasing its own source code, nor does it provide a link to a code repository for the described methodology. |
| Open Datasets | No | The paper mentions '19-state random walk task', 'Modified Cliffwalk', and 'Pinball domain'. While these are known reinforcement learning tasks, the paper does not provide specific links, DOIs, or formal citations for accessing the exact dataset configurations or environment implementations used in their experiments. |
| Dataset Splits | No | The paper mentions setting 'step-sizes set via a linear search over α {0.1,0.2,0.3,0.4}' but does not specify any explicit dataset splits (e.g., percentages or sample counts) used for validation during this search or for training/testing. |
| Hardware Specification | No | The paper does not provide any specific details about the hardware used to run the experiments, such as CPU models, GPU models, or memory specifications. |
| Software Dependencies | No | The paper does not list any specific software dependencies with their version numbers (e.g., programming languages, libraries, or frameworks like PyTorch, TensorFlow, etc.) that would be necessary for reproducibility. |
| Experiment Setup | Yes | The termination conditions β and ζ are evaluated in the range of {0.1,0.5,0.8,1}, with the first value being positive to ensure adequate state visitation. The step-sizes set via a linear search over α {0.1,0.2,0.3,0.4}. The discount factor γ = 0.99. |