On the Generalization Gap in Reparameterizable Reinforcement Learning

Authors: Huan Wang, Stephan Zheng, Caiming Xiong, Richard Socher

ICML 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We now present empirical measurements in simulations to verify some claims made in section 10 and 11.
Researcher Affiliation Industry 1Salesforce Research, Palo Alto CA, USA. Correspondence to: Huan Wang <huan.wang@salesforce.com>.
Pseudocode Yes Algorithm 1 Reparameterized MDP and Algorithm 2 Reparameterizzble RL
Open Source Code No The paper does not contain any explicit statement about providing open-source code for the described methodology or a link to a code repository.
Open Datasets No The paper describes generating synthetic data for its simulations (“randomly sample ξ0, ξ1, . . . , ξT for n = 128 training and testing episodes”) and does not refer to or provide access information for any publicly available or open datasets.
Dataset Splits No The paper states “randomly sample ξ0, ξ1, . . . , ξT for n = 128 training and testing episodes” but does not provide specific details on dataset splits for training, validation, and testing, such as percentages, absolute counts, or references to predefined splits.
Hardware Specification No The paper does not provide specific hardware details (e.g., exact GPU/CPU models, memory amounts) used for running its experiments or simulations.
Software Dependencies No The paper mentions using “Adam (Kingma & Ba, 2015) to optimize” but does not provide specific version numbers for any software components, libraries, or programming languages used in the experiments.
Experiment Setup Yes We set the length of the episode T = 128, and randomly sample ξ0, ξ1, . . . , ξT for n = 128 training and testing episodes. Then we use the same random noise to evaluate a series of policy classes with different temperatures τ {0.001, 0.01, 0.1, 1, 10, 100, 1000}. ... We use Adam (Kingma & Ba, 2015) to optimize with initial learning rates 10−2 and 10−3. When the reward stops increasing we halved the learning rate. ... for each trial we ran the training for 1024 epochs with learning rate of 1e-2 and 1e-3...