Online Meta-Critic Learning for Off-Policy Actor-Critic Methods

Authors: Wei Zhou, Yiying Li, Yongxin Yang, Huaimin Wang, Timothy Hospedales

NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We demonstrate that online meta-critic learning benefits to a variety of continuous control tasks when combined with contemporary Off P-AC methods DDPG, TD3 and SAC. 4 Experiments and Evaluation
Researcher Affiliation Collaboration Wei Zhou 1, Yiying Li 1, Yongxin Yang2, Huaimin Wang1, Timothy M. Hospedales2,3 1College of Computer, National University of Defense Technology 2School of Informatics, The University of Edinburgh 3Samsung AI Centre, Cambridge
Pseudocode Yes Algorithm 1 Online Meta-Critic Learning for Off P-AC RL
Open Source Code Yes Our demo code can be viewed on https://github.com/zwfightzw/Meta-Critic.
Open Datasets Yes We evaluate the methods on a suite of seven Mu Jo Co tasks [39] in Open AI Gym [4], two Mu Jo Co tasks in rllab [5], and a simulated racing car TORCS [22].
Dataset Splits No The paper mentions that 'dtrn and dval are different transition batches from replay buffer' and details how these batches are sampled for meta-training and meta-testing. However, it does not specify exact percentages or sample counts for train/validation splits of a fixed dataset.
Hardware Specification No The paper does not explicitly mention any specific hardware details such as GPU models, CPU types, or memory used for running the experiments.
Software Dependencies No The paper mentions using 'OpenAI Gym' and 'Mu Jo Co' tasks and refers to open-source implementations of DDPG, TD3, and SAC, but it does not provide specific version numbers for these software components or other libraries needed to replicate the experiments.
Experiment Setup Yes For our implementation of meta-critic, we use a three-layer neural network with an input dimension of π (300 in DDPG and TD3, 256 in SAC), two hidden feed-forward layers of 100 hidden nodes each, and Re LU non-linearity between layers. In Mu Jo Co cases we integrate our meta-critic with learning rate 0.001. The details of TORCS hyper-parameters are in the supplementary material.