Adaptive Batch Size for Safe Policy Gradients
Authors: Matteo Papini, Matteo Pirotta, Marcello Restelli
NeurIPS 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Besides providing theoretical guarantees, we show numerical simulations to analyse the behaviour of our methods. and Finally, in Section 5 we empirically analyse the behaviour of the proposed methods on a simple simulated control task. |
| Researcher Affiliation | Academia | Matteo Papini DEIB Politecnico di Milano, Italy, Matteo Pirotta Seque L Team Inria Lille, France, Marcello Restelli DEIB Politecnico di Milano, Italy |
| Pseudocode | No | No structured pseudocode or algorithm blocks were found. |
| Open Source Code | No | The paper does not provide any concrete access information (e.g., repository link, explicit statement of code release, or mention of code in supplementary materials) for the described methodology. |
| Open Datasets | Yes | In this section, we test the proposed methods on the linear-quadratic Gaussian regulation (LQG) problem [23]. |
| Dataset Splits | No | The paper discusses the number of trajectories used for learning, but does not provide specific training, validation, or test dataset splits (e.g., percentages or counts) or mention cross-validation. |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., CPU/GPU models, memory) used for running its experiments. |
| Software Dependencies | No | The paper does not provide specific software dependencies, such as library names with version numbers, needed to replicate the experiments. |
| Experiment Setup | Yes | The LQG problem is defined by transition model st+1 N(st + at, σ2 0), Gaussian policy at N(θ s, σ2) and reward rt = 0.5(s2 t + a2 t). In all our simulations we use σ0 = 0... Both action and state variables are bounded to the interval [ 2, 2] and the initial state is drawn uniformly at random. We use a discount factor γ = 0.9... starting from θ = 0.55 and stopping after a total of one million trajectories. In the following simulations, we use σ = 1 and start from θ = 0, stopping after a total of 30 million trajectories. |