Learning Environmental Calibration Actions for Policy Self-Evolution
Authors: Chao Zhang, Yang Yu, Zhi-Hua Zhou
IJCAI 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Taking three robotic arm controlling tasks as the test beds, we show that the proposed method can learn a fine policy for a new arm with only a few (e.g. five) samples of the target environment. and 4 Experiments We empirically evaluate POSEC, particularly, answering the following questions: Q1: Can the learned calibration actions effectively extract features for the environment, and be better than random actions? Is the number of the calibration actions effect the performance? Q2: How do the calibration actions act? Q3: Can the self-evolved policy serve as a better initial policy for environment specific-refinement? |
| Researcher Affiliation | Academia | Chao Zhang, Yang Yu, Zhi-Hua Zhou National Key Laboratory for Novel Software Technology, Nanjing University, Nanjing 210023, China {zhangc,yuy,zhouzh}@lamda.nju.edu.cn |
| Pseudocode | Yes | Algorithm 1 POSEC Training Process and Algorithm 2 POSEC Calibration Process |
| Open Source Code | Yes | The experiment codes are at https://github.com/eyounx/POSEC. |
| Open Datasets | Yes | We employ three robotic arm controlling tasks that use Mujoco physics simulator from Open AI Gym (https://gym.openai.com). |
| Dataset Splits | Yes | We collect M1 environments, {MDP1, MDP2, . . . , MDPM1}. ... We then draw another set of M2 environments {MDP 1, MDP 2, . . . , MDP M2}... Finally, we generate M3 = 20 environments to evaluate the regression model and the calibration actions. |
| Hardware Specification | No | No specific hardware details (GPU/CPU models, memory, or specific computer configurations) used for running experiments were mentioned. |
| Software Dependencies | No | The paper mentions 'Mujoco physics simulator', 'Open AI Gym', 'TRPO', and 'SRACOS' from 'ZOOpt', but does not provide specific version numbers for these software components. |
| Experiment Setup | Yes | For each environment of each task, a base policy is trained, and all these 100 policies are represented as neural networks with the same structure (two hidden layers with 64 nodes). In the TRPO training process, we set the discount factor γ to be 0.99 and the number of iteration to be 250. ... We use the algorithm implementation from https://github.com/eyounx/ZOOpt, with the sample budget 250. |