Learning Complex Neural Network Policies with Trajectory Optimization

Authors: Sergey Levine, Vladlen Koltun

ICML 2014 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluated our approach on a set of challenging locomotion tasks, including a push recovery task that requires the policy to combine multiple recovery strategies learned in parallel from multiple trajectories. Our approach successfully learned a policy that could not only perform multiple different recoveries, but could also correctly choose the best strategy under new conditions. 4. Experimental Evaluation
Researcher Affiliation Collaboration Sergey Levine SVLEVINE@CS.STANFORD.EDU Computer Science Department, Stanford University, Stanford, CA 94305 USA Vladlen Koltun VLADLEN@ADOBE.COM Adobe Research, San Francisco, CA 94103 USA
Pseudocode Yes Algorithm 1 Constrained guided policy search Algorithm 2 Trajectory optimization iteration
Open Source Code No The paper provides a link to a supplementary video but does not explicitly state that the source code for the methodology is available, nor does it provide a direct link to a code repository.
Open Datasets No The paper describes the simulated environment and how initial trajectories were generated (e.g., 'Mu Jo Co physics simulator', 'hand-crafted locomotion system'), but does not mention the use of a specific publicly available dataset nor provide access information for any generated data.
Dataset Splits No The paper does not provide specific details regarding dataset splits (e.g., percentages, sample counts, or explicit splitting methodology) for training, validation, or testing.
Hardware Specification No The paper mentions running experiments on a 'simulated robot' within the 'Mu Jo Co physics simulator' but does not provide specific hardware details such as GPU/CPU models, memory, or other computing resource specifications.
Software Dependencies No The paper mentions 'Mu Jo Co physics simulator' and 'MATLAB CARE solver' but does not provide specific version numbers for these or any other software dependencies.
Experiment Setup Yes The policies consisted of neural networks with one hidden layer, with a soft rectifier a = log(1 + exp(z)) at the first layer and linear connections to the output layer. Gaussian noise with a learned diagonal covariance was added to the output to create a stochastic policy. When evaluating the cost of a policy, the noise was removed, yielding a deterministic controller. While this class of policies is very expressive, it poses a considerable challenge for policy search methods, due to its nonlinearity and high dimensionality. As discussed in Section 3, the stochasticity of the policy depends on the cost magnitude. A low cost will produce broad trajectory distributions, which are good for learning, but will also produce a more stochastic policy, which might perform poorly. To speed up learning and still achieve a good final policy, we found it useful to gradually increase the cost by a factor of 10 over the first 50 iterations.