Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
On the Generalization Gap in Reparameterizable Reinforcement Learning
Authors: Huan Wang, Stephan Zheng, Caiming Xiong, Richard Socher
ICML 2019 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We now present empirical measurements in simulations to verify some claims made in section 10 and 11. |
| Researcher Affiliation | Industry | 1Salesforce Research, Palo Alto CA, USA. Correspondence to: Huan Wang <EMAIL>. |
| Pseudocode | Yes | Algorithm 1 Reparameterized MDP and Algorithm 2 Reparameterizzble RL |
| Open Source Code | No | The paper does not contain any explicit statement about providing open-source code for the described methodology or a link to a code repository. |
| Open Datasets | No | The paper describes generating synthetic data for its simulations (“randomly sample ξ0, ξ1, . . . , ξT for n = 128 training and testing episodes”) and does not refer to or provide access information for any publicly available or open datasets. |
| Dataset Splits | No | The paper states “randomly sample ξ0, ξ1, . . . , ξT for n = 128 training and testing episodes” but does not provide specific details on dataset splits for training, validation, and testing, such as percentages, absolute counts, or references to predefined splits. |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., exact GPU/CPU models, memory amounts) used for running its experiments or simulations. |
| Software Dependencies | No | The paper mentions using “Adam (Kingma & Ba, 2015) to optimize” but does not provide specific version numbers for any software components, libraries, or programming languages used in the experiments. |
| Experiment Setup | Yes | We set the length of the episode T = 128, and randomly sample ξ0, ξ1, . . . , ξT for n = 128 training and testing episodes. Then we use the same random noise to evaluate a series of policy classes with different temperatures τ {0.001, 0.01, 0.1, 1, 10, 100, 1000}. ... We use Adam (Kingma & Ba, 2015) to optimize with initial learning rates 10−2 and 10−3. When the reward stops increasing we halved the learning rate. ... for each trial we ran the training for 1024 epochs with learning rate of 1e-2 and 1e-3... |