Safe Policy Search for Lifelong Reinforcement Learning with Sublinear Regret

Authors: Haitham Bou Ammar, Rasul Tutunov, Eric Eaton

ICML 2015 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental 6. Experimental Validation
Researcher Affiliation Academia Haitham Bou Ammar HAITHAMB@SEAS.UPENN.EDU Rasul Tutunov TUTUNOV@SEAS.UPENN.EDU Eric Eaton EEATON@CIS.UPENN.EDU University of Pennsylvania, Computer and Information Science Department, Philadelphia, PA 19104 USA
Pseudocode Yes Algorithm 1 Safe Online Lifelong Policy Search
Open Source Code Yes The complete approach is given in Algorithm 1 and is available as a software implementation on the authors websites.
Open Datasets No The paper uses standard benchmark dynamical systems (simple mass, cart-pole, quadrotor) and generates tasks by varying system parameters. It does not provide concrete access information (link, DOI, specific citation with author/year) for publicly available datasets used for training.
Dataset Splits No The paper mentions 'cross-validation over 3 tasks' for choosing latent space dimensionality and 'line search' for learning step size, which are validation processes. However, it does not provide specific dataset split information (exact percentages, sample counts, or citations to predefined splits) for training, validation, and testing.
Hardware Specification No The paper does not provide specific hardware details (exact GPU/CPU models, processor types, memory amounts, or detailed computer specifications) used for running its experiments.
Software Dependencies No The paper does not provide specific ancillary software details, such as library or solver names with version numbers.
Experiment Setup Yes We ran each experiment for a total of R rounds, varying from 150 for the simple mass to 10, 000 for the quadrotor... At each round j, the learner observed a task tj through 50 trajectories of 150 steps... The dimensionality k of the latent space was chosen independently for each domain via cross-validation over 3 tasks, and the learning step size for each task domain was determined by a line search after gathering 10 trajectories of length 150. ...We also varied the number of iterations in our alternating optimization from 10 to 100...