Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Policy Optimization via Importance Sampling
Authors: Alberto Maria Metelli, Matteo Papini, Francesco Faccio, Marcello Restelli
NeurIPS 2018 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Finally, the algorithm is tested on a selection of continuous control tasks, with both linear and deep policies, and compared with state-of-the-art policy optimization methods. |
| Researcher Affiliation | Academia | Alberto Maria Metelli Politecnico di Milano, Milan, Italy EMAIL; Matteo Papini Politecnico di Milano, Milan, Italy EMAIL; Francesco Faccio Politecnico di Milano, Milan, Italy IDSIA, USI-SUPSI, Lugano, Switzerland EMAIL; Marcello Restelli Politecnico di Milano, Milan, Italy EMAIL |
| Pseudocode | Yes | The pseudo-code of POIS is reported in Algorithm 1. (Also Algorithm 2) |
| Open Source Code | Yes | The implementation of POIS can be found at https://github.com/T3p/pois. |
| Open Datasets | Yes | ...on classical control tasks [12, 57]. (Reference [12] is "Benchmarking deep reinforcement learning for continuous control" which uses standard environments.) |
| Dataset Splits | No | The paper describes using a "current policy" to collect trajectories for optimization, and performing "of๏ฌine optimization". It does not explicitly mention fixed training, validation, or test dataset splits with percentages or counts, as is common in supervised learning contexts. |
| Hardware Specification | Yes | We gratefully acknowledge the support of NVIDIA Corporation with the donation of the Tesla K40cm, Titan XP and Tesla V100 used for this research. |
| Software Dependencies | No | The paper does not specify versions for any software dependencies, such as programming languages, libraries, or frameworks used for implementation. |
| Experiment Setup | Yes | All experimental details are provided in Appendix F. (Appendix F.1 mentions: "For linear policies we used a learning rate ฮฑ = 0.001 and a batch size N = 20 trajectories." Appendix F.2 mentions: "We adopted the same network architecture for all environments: 3 layers: 100, 50, 25 neurons each.") |