reproducibilityindex.ai

On the Generalization Gap in Reparameterizable Reinforcement Learning

Authors: Huan Wang, Stephan Zheng, Caiming Xiong, Richard Socher

ICML 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We now present empirical measurements in simulations to verify some claims made in section 10 and 11.
Researcher Affiliation	Industry	1Salesforce Research, Palo Alto CA, USA. Correspondence to: Huan Wang <huan.wang@salesforce.com>.
Pseudocode	Yes	Algorithm 1 Reparameterized MDP and Algorithm 2 Reparameterizzble RL
Open Source Code	No	The paper does not contain any explicit statement about providing open-source code for the described methodology or a link to a code repository.
Open Datasets	No	The paper describes generating synthetic data for its simulations (“randomly sample ξ0, ξ1, . . . , ξT for n = 128 training and testing episodes”) and does not refer to or provide access information for any publicly available or open datasets.
Dataset Splits	No	The paper states “randomly sample ξ0, ξ1, . . . , ξT for n = 128 training and testing episodes” but does not provide specific details on dataset splits for training, validation, and testing, such as percentages, absolute counts, or references to predefined splits.
Hardware Specification	No	The paper does not provide specific hardware details (e.g., exact GPU/CPU models, memory amounts) used for running its experiments or simulations.
Software Dependencies	No	The paper mentions using “Adam (Kingma & Ba, 2015) to optimize” but does not provide specific version numbers for any software components, libraries, or programming languages used in the experiments.
Experiment Setup	Yes	We set the length of the episode T = 128, and randomly sample ξ0, ξ1, . . . , ξT for n = 128 training and testing episodes. Then we use the same random noise to evaluate a series of policy classes with different temperatures τ {0.001, 0.01, 0.1, 1, 10, 100, 1000}. ... We use Adam (Kingma & Ba, 2015) to optimize with initial learning rates 10−2 and 10−3. When the reward stops increasing we halved the learning rate. ... for each trial we ran the training for 1024 epochs with learning rate of 1e-2 and 1e-3...