Adaptive Batch Size for Safe Policy Gradients

Authors: Matteo Papini, Matteo Pirotta, Marcello Restelli

NeurIPS 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Besides providing theoretical guarantees, we show numerical simulations to analyse the behaviour of our methods. and Finally, in Section 5 we empirically analyse the behaviour of the proposed methods on a simple simulated control task.
Researcher Affiliation Academia Matteo Papini DEIB Politecnico di Milano, Italy, Matteo Pirotta Seque L Team Inria Lille, France, Marcello Restelli DEIB Politecnico di Milano, Italy
Pseudocode No No structured pseudocode or algorithm blocks were found.
Open Source Code No The paper does not provide any concrete access information (e.g., repository link, explicit statement of code release, or mention of code in supplementary materials) for the described methodology.
Open Datasets Yes In this section, we test the proposed methods on the linear-quadratic Gaussian regulation (LQG) problem [23].
Dataset Splits No The paper discusses the number of trajectories used for learning, but does not provide specific training, validation, or test dataset splits (e.g., percentages or counts) or mention cross-validation.
Hardware Specification No The paper does not provide specific hardware details (e.g., CPU/GPU models, memory) used for running its experiments.
Software Dependencies No The paper does not provide specific software dependencies, such as library names with version numbers, needed to replicate the experiments.
Experiment Setup Yes The LQG problem is defined by transition model st+1 N(st + at, σ2 0), Gaussian policy at N(θ s, σ2) and reward rt = 0.5(s2 t + a2 t). In all our simulations we use σ0 = 0... Both action and state variables are bounded to the interval [ 2, 2] and the initial state is drawn uniformly at random. We use a discount factor γ = 0.9... starting from θ = 0.55 and stopping after a total of one million trajectories. In the following simulations, we use σ = 1 and start from θ = 0, stopping after a total of 30 million trajectories.