reproducibilityindex.ai

Single Episode Policy Transfer in Reinforcement Learning

Authors: Jiachen Yang, Brenden Petersen, Hongyuan Zha, Daniel Faissol

ICLR 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We conducted experiments on three benchmark domains with diverse challenges to evaluate the performance, speed of reward attainment, and computational time of SEPT versus ﬁve baselines in the single test episode. We evaluated four ablation and variants of SEPT to investigate the necessity of all algorithmic design choices.
Researcher Affiliation	Collaboration	Jiachen Yang Georgia Institute of Technology, USA jiachen.yang@gatech.edu, Brenden Petersen Lawrence Livermore National Laboratory, USA petersen33@llnl.gov, Hongyuan Zha Georgia Institute of Technology, USA zha@cc.gatech.edu, Daniel Faissol Lawrence Livermore National Laboratory, USA faissol1@llnl.gov
Pseudocode	Yes	Algorithm 1 Single Episode Policy Transfer: training phase, Algorithm 2 Single Episode Policy Transfer: testing phase
Open Source Code	Yes	Code for all experiments is available at https://github.com/011235813/SEPT.
Open Datasets	Yes	We use the same continuous state discrete action Hi P-MDPs proposed by Killian et al. (2017) for benchmarking.
Dataset Splits	Yes	There are 2, 8 and 5 unique training instances, and 2, 5, and 5 validation instances, respectively. Hyperparameters were adjusted using a coarse coordinate search on validation performance.
Hardware Specification	No	The paper discusses computation time but does not provide specific details regarding the hardware (e.g., CPU, GPU models, memory) used for the experiments.
Software Dependencies	No	The paper mentions using 'DDQN with prioritized replay' and provides hyperparameters for it, but does not list specific software versions (e.g., PyTorch, TensorFlow) or library versions used for implementation.
Experiment Setup	Yes	VAE learning rate was 1e-4 for all experiments. Size of the dataset D of probe trajectories was limited to 1000, with earliest trajectories discarded. 10 minibatches from D were used for each VAE training step. We used β = 1 for the VAE. Probe policy learning rate was 1e-3 for all experiments. DDQN minibatch size was 32, one training step was done for every 10 environment steps, ϵend = 0.15, learning rate was 1e-3, gradient clip was 2.5, γ = 0.99, and target network update rate was 5e-3.