Time Limits in Reinforcement Learning

Authors: Fabio Pardo, Arash Tavakoli, Vitaly Levdik, Petar Kormushev

ICML 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate the impact of these considerations on a range of novel and popular benchmark domains using tabular Q-learning and Proximal Policy Optimization (PPO), a modern deep reinforcement learning (Arulkumaran et al., 2017; Henderson et al., 2017) algorithm which has recently been used to achieve state-of-the-art performance in many domains (Schulman et al., 2017; Heess et al., 2017). We empirically show that time-awareness significantly improves the performance of PPO for the time-limited tasks and can sometimes result in interesting behaviors.
Researcher Affiliation Academia 1Robot Intelligence Lab, Imperial College London, UK.
Pseudocode No The paper describes algorithmic concepts and modifications but does not include structured pseudocode or an explicitly labeled algorithm block.
Open Source Code Yes The source code and videos can be found at: sites.google.com/view/time-limits-in-rl.
Open Datasets Yes All novel tasks are implemented using the Open AI Gym (Brockman et al., 2016) and the standard benchmarks are from the Mu Jo Co (Todorov et al., 2012) Gym collection.
Dataset Splits No The paper discusses training steps and evaluation episodes in environments like Open AI Gym and Mu Jo Co. While it mentions reproducibility through random seeds, it does not provide specific training, validation, and test dataset split percentages or sample counts in the traditional sense for a fixed dataset.
Hardware Specification No The paper mentions using "computation resources provided by Microsoft via a Microsoft Azure award" but does not provide specific hardware details such as GPU/CPU models, memory, or other detailed computer specifications used for running its experiments.
Software Dependencies No The paper mentions using "Open AI Baselines implementation of PPO", "Open AI Gym", and "Mu Jo Co", but it does not specify the version numbers for these software dependencies.
Experiment Setup Yes We use the Open AI Baselines (Hesse et al., 2017) implementation of PPO with the hyperparameters reported by Schulman et al. (2017), unless stated otherwise. For each task involving PPO, to achieve perfect reproducibility, we used the same 10 seeds (0, 1000, ..., 9000) to initialize the pseudo-random number generators for the agents and environments. The time-aware version of PPO concatenates the observations provided by the environment and the remaining time represented by a scalar (normalized from -1 to 1). The partial-episode bootstrapping version of PPO makes a distinction between environment resets and terminations by using the value of the last state in the evaluation of the advantages if no termination is encountered. For Hopper-v1 and Walker2d-v1, the evaluation episodes are limited to 10^6 time steps and the discounted sum of rewards is represented, while for Infinite Cube Pusher-v0 the evaluations are limited to 1000 time steps and the number of targets reached per episode is represented. An entropy coefficient of 0.01 was used to encourage exploration. We used tabular Q-learning with random actions, trained until convergence with a decaying learning rate and a discount factor of 0.99.