Single Episode Policy Transfer in Reinforcement Learning

Authors: Jiachen Yang, Brenden Petersen, Hongyuan Zha, Daniel Faissol

ICLR 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We conducted experiments on three benchmark domains with diverse challenges to evaluate the performance, speed of reward attainment, and computational time of SEPT versus five baselines in the single test episode. We evaluated four ablation and variants of SEPT to investigate the necessity of all algorithmic design choices.
Researcher Affiliation Collaboration Jiachen Yang Georgia Institute of Technology, USA jiachen.yang@gatech.edu, Brenden Petersen Lawrence Livermore National Laboratory, USA petersen33@llnl.gov, Hongyuan Zha Georgia Institute of Technology, USA zha@cc.gatech.edu, Daniel Faissol Lawrence Livermore National Laboratory, USA faissol1@llnl.gov
Pseudocode Yes Algorithm 1 Single Episode Policy Transfer: training phase, Algorithm 2 Single Episode Policy Transfer: testing phase
Open Source Code Yes Code for all experiments is available at https://github.com/011235813/SEPT.
Open Datasets Yes We use the same continuous state discrete action Hi P-MDPs proposed by Killian et al. (2017) for benchmarking.
Dataset Splits Yes There are 2, 8 and 5 unique training instances, and 2, 5, and 5 validation instances, respectively. Hyperparameters were adjusted using a coarse coordinate search on validation performance.
Hardware Specification No The paper discusses computation time but does not provide specific details regarding the hardware (e.g., CPU, GPU models, memory) used for the experiments.
Software Dependencies No The paper mentions using 'DDQN with prioritized replay' and provides hyperparameters for it, but does not list specific software versions (e.g., PyTorch, TensorFlow) or library versions used for implementation.
Experiment Setup Yes VAE learning rate was 1e-4 for all experiments. Size of the dataset D of probe trajectories was limited to 1000, with earliest trajectories discarded. 10 minibatches from D were used for each VAE training step. We used β = 1 for the VAE. Probe policy learning rate was 1e-3 for all experiments. DDQN minibatch size was 32, one training step was done for every 10 environment steps, ϵend = 0.15, learning rate was 1e-3, gradient clip was 2.5, γ = 0.99, and target network update rate was 5e-3.