Unifying Task Specification in Reinforcement Learning

Authors: Martha White

ICML 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this work, we propose a formalism for reinforcement learning task specification that unifies many of these generalizations. The focus of the formalism is to separate the specification of the dynamics of the environment and the specification of the objective within that environment. We empirically demonstrate the utility of soft termination in the next section. We explore different transition-based discounts in the taxi domain (Dietterich, 2000; Diuk et al., 2008). Figure 2 illustrates three policies for one part of the taxi domain, obtained with three different discount functions.
Researcher Affiliation Academia Martha White 1 Department of Computer Science, Indiana University. Correspondence to: Martha White <martha@indiana.edu>.
Pseudocode Yes We provide these details in the appendix for completeness, with theorem statement and proof in Appendix F and pseudocode in Appendix D.
Open Source Code No The paper does not contain an explicit statement about releasing source code or a link to a code repository.
Open Datasets No The paper uses the 'taxi domain' which is a simulated environment for reinforcement learning, not a static dataset with explicit public access information like a URL, DOI, or repository for download. Therefore, it does not provide concrete access information for a publicly available or open 'dataset' in the conventional sense.
Dataset Splits No The paper describes experiments in a simulated environment (taxi domain) but does not specify dataset splits (e.g., training, validation, test percentages or counts) as would be done for a fixed dataset. It mentions 'optimal policies and value functions are computed iteratively, with an extensive number of iterations' and '5000 runs' for evaluation, but not specific data splits.
Hardware Specification No The paper does not provide any specific details about the hardware used for running the experiments (e.g., CPU/GPU models, memory, or cloud instance types).
Software Dependencies No The paper does not mention any specific software dependencies with version numbers (e.g., Python 3.x, PyTorch 1.x).
Experiment Setup Yes We modify the domain to include the orientation of the taxi, with additional cost for not continuing in the current orientation. This encodes that turning right, left or going backwards are more costly than going forwards, with additional negative rewards of -0.05, -0.1 and -0.2 respectively. This additional cost is further multiplied by a factor of 2 when there is a passenger in the vehicle. The optimal policy is learned using a soft-termination, which takes into consideration the importance of approaching the passenger location with the right orientation to minimize turns after picking up the passenger. (b) The optimal strategy, with γ(Car in source, Pickup, P in Car) = 0.1 and a discount 0.99 elsewhere. (c) For γ(Car in source, Pickup, P in Car) = 0. ... over 100 steps, with 5000 runs