Learning Environmental Calibration Actions for Policy Self-Evolution

Authors: Chao Zhang, Yang Yu, Zhi-Hua Zhou

IJCAI 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Taking three robotic arm controlling tasks as the test beds, we show that the proposed method can learn a fine policy for a new arm with only a few (e.g. five) samples of the target environment. and 4 Experiments We empirically evaluate POSEC, particularly, answering the following questions: Q1: Can the learned calibration actions effectively extract features for the environment, and be better than random actions? Is the number of the calibration actions effect the performance? Q2: How do the calibration actions act? Q3: Can the self-evolved policy serve as a better initial policy for environment specific-refinement?
Researcher Affiliation Academia Chao Zhang, Yang Yu, Zhi-Hua Zhou National Key Laboratory for Novel Software Technology, Nanjing University, Nanjing 210023, China {zhangc,yuy,zhouzh}@lamda.nju.edu.cn
Pseudocode Yes Algorithm 1 POSEC Training Process and Algorithm 2 POSEC Calibration Process
Open Source Code Yes The experiment codes are at https://github.com/eyounx/POSEC.
Open Datasets Yes We employ three robotic arm controlling tasks that use Mujoco physics simulator from Open AI Gym (https://gym.openai.com).
Dataset Splits Yes We collect M1 environments, {MDP1, MDP2, . . . , MDPM1}. ... We then draw another set of M2 environments {MDP 1, MDP 2, . . . , MDP M2}... Finally, we generate M3 = 20 environments to evaluate the regression model and the calibration actions.
Hardware Specification No No specific hardware details (GPU/CPU models, memory, or specific computer configurations) used for running experiments were mentioned.
Software Dependencies No The paper mentions 'Mujoco physics simulator', 'Open AI Gym', 'TRPO', and 'SRACOS' from 'ZOOpt', but does not provide specific version numbers for these software components.
Experiment Setup Yes For each environment of each task, a base policy is trained, and all these 100 policies are represented as neural networks with the same structure (two hidden layers with 64 nodes). In the TRPO training process, we set the discount factor γ to be 0.99 and the number of iteration to be 250. ... We use the algorithm implementation from https://github.com/eyounx/ZOOpt, with the sample budget 250.