On the Generalization Gap in Reparameterizable Reinforcement Learning
Authors: Huan Wang, Stephan Zheng, Caiming Xiong, Richard Socher
ICML 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We now present empirical measurements in simulations to verify some claims made in section 10 and 11. |
| Researcher Affiliation | Industry | 1Salesforce Research, Palo Alto CA, USA. Correspondence to: Huan Wang <huan.wang@salesforce.com>. |
| Pseudocode | Yes | Algorithm 1 Reparameterized MDP and Algorithm 2 Reparameterizzble RL |
| Open Source Code | No | The paper does not contain any explicit statement about providing open-source code for the described methodology or a link to a code repository. |
| Open Datasets | No | The paper describes generating synthetic data for its simulations (“randomly sample ξ0, ξ1, . . . , ξT for n = 128 training and testing episodes”) and does not refer to or provide access information for any publicly available or open datasets. |
| Dataset Splits | No | The paper states “randomly sample ξ0, ξ1, . . . , ξT for n = 128 training and testing episodes” but does not provide specific details on dataset splits for training, validation, and testing, such as percentages, absolute counts, or references to predefined splits. |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., exact GPU/CPU models, memory amounts) used for running its experiments or simulations. |
| Software Dependencies | No | The paper mentions using “Adam (Kingma & Ba, 2015) to optimize” but does not provide specific version numbers for any software components, libraries, or programming languages used in the experiments. |
| Experiment Setup | Yes | We set the length of the episode T = 128, and randomly sample ξ0, ξ1, . . . , ξT for n = 128 training and testing episodes. Then we use the same random noise to evaluate a series of policy classes with different temperatures τ {0.001, 0.01, 0.1, 1, 10, 100, 1000}. ... We use Adam (Kingma & Ba, 2015) to optimize with initial learning rates 10−2 and 10−3. When the reward stops increasing we halved the learning rate. ... for each trial we ran the training for 1024 epochs with learning rate of 1e-2 and 1e-3... |