Low-Switching Policy Gradient with Exploration via Online Sensitivity Sampling
Authors: Yunfan Li, Yiran Wang, Yu Cheng, Lin Yang
ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Moreover, we empirically test our theory with deep neural nets to show the benefits of the theoretical inspiration. |
| Researcher Affiliation | Collaboration | 1Department of Electical and Computer Engineering, University of California, Los Angeles, Los Angeles, CA, USA 2Microsoft Research, Redmond, WA, USA. |
| Pseudocode | Yes | Algorithm 1 LPO; Algorithm 2 LPO (Practical Implementation); Algorithm 3 S-Sampling (Sensitivity-Sampling); Algorithm 4 Policy Update; Algorithm 5 Behaviour Policy Sampling; Algorithm 6 Policy Evaluation Oracle; Algorithm 7 d-sampler |
| Open Source Code | No | The paper states: "We implemented our method based on the open source package (Raffin et al., 2021)", indicating they used an existing open-source framework (Stable-Baselines3) for their implementation, but they do not explicitly state that the source code for their *own* specific methodology (LPO) is publicly available or provide a link to it. |
| Open Datasets | Yes | To further illustrate the effectiveness of our width function and our proposed sensitivity sampling, we compare (Schulman et al., 2017; Feng et al., 2021) with our proposed LPO in sparse reward Mu Jo Co environments (Todorov et al., 2012). |
| Dataset Splits | No | The paper uses the MuJoCo environments but does not explicitly state specific training, validation, and test dataset splits needed for reproduction. |
| Hardware Specification | No | The paper does not provide specific hardware details (like GPU/CPU models or memory specifications) used for running its experiments. |
| Software Dependencies | No | The paper states: "We implemented our method based on the open source package (Raffin et al., 2021)". While this refers to Stable-Baselines3, it does not specify its version number or any other software dependencies with version numbers, which is necessary for reproducibility. |
| Experiment Setup | Yes | The detailed hyperparameters are showed in the table G. Hyperparameter Value (LPO, ENIAC) Value (PPO) N 2048 2048 T 2e6 2e6 λ 0.95 0.95 γ(int) 0.999 γ(ext) 0.99 0.99 α 2 β 1 Learning rate 1e-4 1e-4 Batch size 32, 16 32, 16 Number of epoch per iteration 10 10 |