Single Episode Policy Transfer in Reinforcement Learning
Authors: Jiachen Yang, Brenden Petersen, Hongyuan Zha, Daniel Faissol
ICLR 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We conducted experiments on three benchmark domains with diverse challenges to evaluate the performance, speed of reward attainment, and computational time of SEPT versus five baselines in the single test episode. We evaluated four ablation and variants of SEPT to investigate the necessity of all algorithmic design choices. |
| Researcher Affiliation | Collaboration | Jiachen Yang Georgia Institute of Technology, USA jiachen.yang@gatech.edu, Brenden Petersen Lawrence Livermore National Laboratory, USA petersen33@llnl.gov, Hongyuan Zha Georgia Institute of Technology, USA zha@cc.gatech.edu, Daniel Faissol Lawrence Livermore National Laboratory, USA faissol1@llnl.gov |
| Pseudocode | Yes | Algorithm 1 Single Episode Policy Transfer: training phase, Algorithm 2 Single Episode Policy Transfer: testing phase |
| Open Source Code | Yes | Code for all experiments is available at https://github.com/011235813/SEPT. |
| Open Datasets | Yes | We use the same continuous state discrete action Hi P-MDPs proposed by Killian et al. (2017) for benchmarking. |
| Dataset Splits | Yes | There are 2, 8 and 5 unique training instances, and 2, 5, and 5 validation instances, respectively. Hyperparameters were adjusted using a coarse coordinate search on validation performance. |
| Hardware Specification | No | The paper discusses computation time but does not provide specific details regarding the hardware (e.g., CPU, GPU models, memory) used for the experiments. |
| Software Dependencies | No | The paper mentions using 'DDQN with prioritized replay' and provides hyperparameters for it, but does not list specific software versions (e.g., PyTorch, TensorFlow) or library versions used for implementation. |
| Experiment Setup | Yes | VAE learning rate was 1e-4 for all experiments. Size of the dataset D of probe trajectories was limited to 1000, with earliest trajectories discarded. 10 minibatches from D were used for each VAE training step. We used β = 1 for the VAE. Probe policy learning rate was 1e-3 for all experiments. DDQN minibatch size was 32, one training step was done for every 10 environment steps, ϵend = 0.15, learning rate was 1e-3, gradient clip was 2.5, γ = 0.99, and target network update rate was 5e-3. |