Exploring Parameter Space with Structured Noise for Meta-Reinforcement Learning
Authors: Hui Xu, Chong Zhang, Jiaxing Wang, Deqiang Ouyang, Yu Zheng, Jie Shao
IJCAI 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimental results on four groups of tasks: cheetah velocity, cheetah direction, ant velocity and ant direction demonstrate the superiority of ESNPS against a number of competitive baselines. We evaluate the proposed ESNPS on four reinforcement learning tasks with Mu Jo Co simulator [Todorov et al., 2012]. |
| Researcher Affiliation | Collaboration | Hui Xu1 , Chong Zhang2 , Jiaxing Wang3,4 , Deqiang Ouyang1 , Yu Zheng2 and Jie Shao1,5 1University of Electronic Science and Technology of China 2Tencent Robotics X 3Institute of Automation, Chinese Academy of Sciences 4University of Chinese Academy of Sciences 5Sichuan Artificial Intelligence Research Institute |
| Pseudocode | Yes | Algorithm 1 ESNPS algorithm |
| Open Source Code | No | The paper does not provide concrete access to source code (specific repository link, explicit code release statement, or code in supplementary materials) for the methodology described in this paper. |
| Open Datasets | No | The paper mentions using MuJoCo simulator for tasks like 'cheetah velocity' and 'ant direction', which are standard reinforcement learning environments, but does not provide explicit access information (link, DOI, formal citation) for a specific dataset used for training or testing. |
| Dataset Splits | No | The paper mentions constructing 'meta-test set' tasks and following a 'protocol proposed in Finn et al. [2017]', but it does not specify explicit data splits (e.g., percentages or counts for train/validation/test) for the individual tasks or the overall data used to train the models within those tasks. |
| Hardware Specification | No | The paper does not provide specific hardware details (exact GPU/CPU models, processor types with speeds, memory amounts, or detailed computer specifications) used for running its experiments. |
| Software Dependencies | No | The paper mentions using 'Trust Region Policy Optimization (TRPO)' and 'neural network policy with two hidden layers of size 100, and Re LU nonlinearities', but does not provide specific software names with version numbers for reproducibility. |
| Experiment Setup | Yes | For all experiments, we use a neural network policy with two hidden layers of size 100, and Re LU nonlinearities. The horizon is set to H = 200, with 20 rollouts per gradient step for all groups of tasks except the ant direction task. The ant direction uses 40 rollouts. The number of gradient steps k for fine-tuning is always set to 4. We set the noise level to σ = 0.01. The scaling factor α then adaptively increases or decreases depending on whether the distance is below or above a certain threshold: α = λα, if d(π, π) < δ 1 λα, otherwise , (9) where λ R+ is used to rescale α, which is set to 1.1 in our experiments. δ is a threshold controlling the acceptable change of actions due to noise injection. |