Safe Policy Search for Lifelong Reinforcement Learning with Sublinear Regret
Authors: Haitham Bou Ammar, Rasul Tutunov, Eric Eaton
ICML 2015 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | 6. Experimental Validation |
| Researcher Affiliation | Academia | Haitham Bou Ammar HAITHAMB@SEAS.UPENN.EDU Rasul Tutunov TUTUNOV@SEAS.UPENN.EDU Eric Eaton EEATON@CIS.UPENN.EDU University of Pennsylvania, Computer and Information Science Department, Philadelphia, PA 19104 USA |
| Pseudocode | Yes | Algorithm 1 Safe Online Lifelong Policy Search |
| Open Source Code | Yes | The complete approach is given in Algorithm 1 and is available as a software implementation on the authors websites. |
| Open Datasets | No | The paper uses standard benchmark dynamical systems (simple mass, cart-pole, quadrotor) and generates tasks by varying system parameters. It does not provide concrete access information (link, DOI, specific citation with author/year) for publicly available datasets used for training. |
| Dataset Splits | No | The paper mentions 'cross-validation over 3 tasks' for choosing latent space dimensionality and 'line search' for learning step size, which are validation processes. However, it does not provide specific dataset split information (exact percentages, sample counts, or citations to predefined splits) for training, validation, and testing. |
| Hardware Specification | No | The paper does not provide specific hardware details (exact GPU/CPU models, processor types, memory amounts, or detailed computer specifications) used for running its experiments. |
| Software Dependencies | No | The paper does not provide specific ancillary software details, such as library or solver names with version numbers. |
| Experiment Setup | Yes | We ran each experiment for a total of R rounds, varying from 150 for the simple mass to 10, 000 for the quadrotor... At each round j, the learner observed a task tj through 50 trajectories of 150 steps... The dimensionality k of the latent space was chosen independently for each domain via cross-validation over 3 tasks, and the learning step size for each task domain was determined by a line search after gathering 10 trajectories of length 150. ...We also varied the number of iterations in our alternating optimization from 10 to 100... |