Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Learning Complex Neural Network Policies with Trajectory Optimization
Authors: Sergey Levine, Vladlen Koltun
ICML 2014 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluated our approach on a set of challenging locomotion tasks, including a push recovery task that requires the policy to combine multiple recovery strategies learned in parallel from multiple trajectories. Our approach successfully learned a policy that could not only perform multiple different recoveries, but could also correctly choose the best strategy under new conditions. 4. Experimental Evaluation |
| Researcher Affiliation | Collaboration | Sergey Levine EMAIL Computer Science Department, Stanford University, Stanford, CA 94305 USA Vladlen Koltun EMAIL Adobe Research, San Francisco, CA 94103 USA |
| Pseudocode | Yes | Algorithm 1 Constrained guided policy search Algorithm 2 Trajectory optimization iteration |
| Open Source Code | No | The paper provides a link to a supplementary video but does not explicitly state that the source code for the methodology is available, nor does it provide a direct link to a code repository. |
| Open Datasets | No | The paper describes the simulated environment and how initial trajectories were generated (e.g., 'Mu Jo Co physics simulator', 'hand-crafted locomotion system'), but does not mention the use of a specific publicly available dataset nor provide access information for any generated data. |
| Dataset Splits | No | The paper does not provide specific details regarding dataset splits (e.g., percentages, sample counts, or explicit splitting methodology) for training, validation, or testing. |
| Hardware Specification | No | The paper mentions running experiments on a 'simulated robot' within the 'Mu Jo Co physics simulator' but does not provide specific hardware details such as GPU/CPU models, memory, or other computing resource specifications. |
| Software Dependencies | No | The paper mentions 'Mu Jo Co physics simulator' and 'MATLAB CARE solver' but does not provide specific version numbers for these or any other software dependencies. |
| Experiment Setup | Yes | The policies consisted of neural networks with one hidden layer, with a soft rectifier a = log(1 + exp(z)) at the first layer and linear connections to the output layer. Gaussian noise with a learned diagonal covariance was added to the output to create a stochastic policy. When evaluating the cost of a policy, the noise was removed, yielding a deterministic controller. While this class of policies is very expressive, it poses a considerable challenge for policy search methods, due to its nonlinearity and high dimensionality. As discussed in Section 3, the stochasticity of the policy depends on the cost magnitude. A low cost will produce broad trajectory distributions, which are good for learning, but will also produce a more stochastic policy, which might perform poorly. To speed up learning and still achieve a good final policy, we found it useful to gradually increase the cost by a factor of 10 over the first 50 iterations. |