reproducibilityindex.ai

Learning Environmental Calibration Actions for Policy Self-Evolution

Authors: Chao Zhang, Yang Yu, Zhi-Hua Zhou

IJCAI 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Taking three robotic arm controlling tasks as the test beds, we show that the proposed method can learn a ﬁne policy for a new arm with only a few (e.g. ﬁve) samples of the target environment. and 4 Experiments We empirically evaluate POSEC, particularly, answering the following questions: Q1: Can the learned calibration actions effectively extract features for the environment, and be better than random actions? Is the number of the calibration actions effect the performance? Q2: How do the calibration actions act? Q3: Can the self-evolved policy serve as a better initial policy for environment speciﬁc-reﬁnement?
Researcher Affiliation	Academia	Chao Zhang, Yang Yu, Zhi-Hua Zhou National Key Laboratory for Novel Software Technology, Nanjing University, Nanjing 210023, China {zhangc,yuy,zhouzh}@lamda.nju.edu.cn
Pseudocode	Yes	Algorithm 1 POSEC Training Process and Algorithm 2 POSEC Calibration Process
Open Source Code	Yes	The experiment codes are at https://github.com/eyounx/POSEC.
Open Datasets	Yes	We employ three robotic arm controlling tasks that use Mujoco physics simulator from Open AI Gym (https://gym.openai.com).
Dataset Splits	Yes	We collect M1 environments, {MDP1, MDP2, . . . , MDPM1}. ... We then draw another set of M2 environments {MDP 1, MDP 2, . . . , MDP M2}... Finally, we generate M3 = 20 environments to evaluate the regression model and the calibration actions.
Hardware Specification	No	No specific hardware details (GPU/CPU models, memory, or specific computer configurations) used for running experiments were mentioned.
Software Dependencies	No	The paper mentions 'Mujoco physics simulator', 'Open AI Gym', 'TRPO', and 'SRACOS' from 'ZOOpt', but does not provide specific version numbers for these software components.
Experiment Setup	Yes	For each environment of each task, a base policy is trained, and all these 100 policies are represented as neural networks with the same structure (two hidden layers with 64 nodes). In the TRPO training process, we set the discount factor γ to be 0.99 and the number of iteration to be 250. ... We use the algorithm implementation from https://github.com/eyounx/ZOOpt, with the sample budget 250.