Total stochastic gradient algorithms and applications in reinforcement learning
Authors: Paavo Parmas
NeurIPS 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate our methods on model-based policy gradient algorithms, achieve good performance, and present evidence towards demystifying the success of the popular PILCO algorithm [5]. We performed model-based RL simulation experiments from the PILCO papers [5, 4]. |
| Researcher Affiliation | Academia | Paavo Parmas Neural Computation Unit Okinawa Institute of Science and Technology Graduate University Okinawa, Japan paavo.parmas@oist.jp |
| Pseudocode | Yes | Algorithm 1 Gaussian shaping gradient with total propagation |
| Open Source Code | No | The paper does not provide a link to source code or explicitly state that source code for the methodology is openly available or included in supplementary materials. |
| Open Datasets | No | The paper describes using data generated from 'model-based RL simulation experiments' and notes 'After each episode all of the data is used to learn separate Gaussian process models', but it does not specify a publicly available dataset with concrete access information (link, DOI, or formal citation). |
| Dataset Splits | No | The paper does not provide specific details about training/validation/test dataset splits, such as percentages or sample counts. It describes the simulation setup and how models are learned/evaluated but not formal data splits. |
| Hardware Specification | No | The paper does not specify any hardware details (e.g., GPU/CPU models, memory) used for running the experiments. |
| Software Dependencies | No | The paper mentions using an 'RMSprop-like learning rule [22]' and 'Gaussian process models [16]' but does not provide specific software versions for libraries, frameworks, or programming languages (e.g., Python 3.x, PyTorch x.x). |
| Experiment Setup | Yes | The experiments consisted of 1 random episode followed by 15 episodes with a learned policy, where the policy is optimized between episodes. Each episode length was 3s, with a 10Hz control frequency. Each task was evaluated separately 100 times with different random number seeds to test repeatability... The policy was optimized using an RMSprop-like learning rule [22]... 600 gradient steps using 300 particles... The learning rate and momentum parameters were α = 5 10 4, γ = 0.9... The policy π was a radial basis function network (a sum of Gaussians) with 50 basis functions and a total of 254 parameters. |