Total stochastic gradient algorithms and applications in reinforcement learning

Authors: Paavo Parmas

NeurIPS 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate our methods on model-based policy gradient algorithms, achieve good performance, and present evidence towards demystifying the success of the popular PILCO algorithm [5]. We performed model-based RL simulation experiments from the PILCO papers [5, 4].
Researcher Affiliation Academia Paavo Parmas Neural Computation Unit Okinawa Institute of Science and Technology Graduate University Okinawa, Japan paavo.parmas@oist.jp
Pseudocode Yes Algorithm 1 Gaussian shaping gradient with total propagation
Open Source Code No The paper does not provide a link to source code or explicitly state that source code for the methodology is openly available or included in supplementary materials.
Open Datasets No The paper describes using data generated from 'model-based RL simulation experiments' and notes 'After each episode all of the data is used to learn separate Gaussian process models', but it does not specify a publicly available dataset with concrete access information (link, DOI, or formal citation).
Dataset Splits No The paper does not provide specific details about training/validation/test dataset splits, such as percentages or sample counts. It describes the simulation setup and how models are learned/evaluated but not formal data splits.
Hardware Specification No The paper does not specify any hardware details (e.g., GPU/CPU models, memory) used for running the experiments.
Software Dependencies No The paper mentions using an 'RMSprop-like learning rule [22]' and 'Gaussian process models [16]' but does not provide specific software versions for libraries, frameworks, or programming languages (e.g., Python 3.x, PyTorch x.x).
Experiment Setup Yes The experiments consisted of 1 random episode followed by 15 episodes with a learned policy, where the policy is optimized between episodes. Each episode length was 3s, with a 10Hz control frequency. Each task was evaluated separately 100 times with different random number seeds to test repeatability... The policy was optimized using an RMSprop-like learning rule [22]... 600 gradient steps using 300 particles... The learning rate and momentum parameters were α = 5 10 4, γ = 0.9... The policy π was a radial basis function network (a sum of Gaussians) with 50 basis functions and a total of 254 parameters.