Online Meta-Critic Learning for Off-Policy Actor-Critic Methods
Authors: Wei Zhou, Yiying Li, Yongxin Yang, Huaimin Wang, Timothy Hospedales
NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We demonstrate that online meta-critic learning benefits to a variety of continuous control tasks when combined with contemporary Off P-AC methods DDPG, TD3 and SAC. 4 Experiments and Evaluation |
| Researcher Affiliation | Collaboration | Wei Zhou 1, Yiying Li 1, Yongxin Yang2, Huaimin Wang1, Timothy M. Hospedales2,3 1College of Computer, National University of Defense Technology 2School of Informatics, The University of Edinburgh 3Samsung AI Centre, Cambridge |
| Pseudocode | Yes | Algorithm 1 Online Meta-Critic Learning for Off P-AC RL |
| Open Source Code | Yes | Our demo code can be viewed on https://github.com/zwfightzw/Meta-Critic. |
| Open Datasets | Yes | We evaluate the methods on a suite of seven Mu Jo Co tasks [39] in Open AI Gym [4], two Mu Jo Co tasks in rllab [5], and a simulated racing car TORCS [22]. |
| Dataset Splits | No | The paper mentions that 'dtrn and dval are different transition batches from replay buffer' and details how these batches are sampled for meta-training and meta-testing. However, it does not specify exact percentages or sample counts for train/validation splits of a fixed dataset. |
| Hardware Specification | No | The paper does not explicitly mention any specific hardware details such as GPU models, CPU types, or memory used for running the experiments. |
| Software Dependencies | No | The paper mentions using 'OpenAI Gym' and 'Mu Jo Co' tasks and refers to open-source implementations of DDPG, TD3, and SAC, but it does not provide specific version numbers for these software components or other libraries needed to replicate the experiments. |
| Experiment Setup | Yes | For our implementation of meta-critic, we use a three-layer neural network with an input dimension of π (300 in DDPG and TD3, 256 in SAC), two hidden feed-forward layers of 100 hidden nodes each, and Re LU non-linearity between layers. In Mu Jo Co cases we integrate our meta-critic with learning rate 0.001. The details of TORCS hyper-parameters are in the supplementary material. |