reproducibilityindex.ai

Online Meta-Critic Learning for Off-Policy Actor-Critic Methods

Authors: Wei Zhou, Yiying Li, Yongxin Yang, Huaimin Wang, Timothy Hospedales

NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We demonstrate that online meta-critic learning benefits to a variety of continuous control tasks when combined with contemporary Off P-AC methods DDPG, TD3 and SAC. 4 Experiments and Evaluation
Researcher Affiliation	Collaboration	Wei Zhou 1, Yiying Li 1, Yongxin Yang2, Huaimin Wang1, Timothy M. Hospedales2,3 1College of Computer, National University of Defense Technology 2School of Informatics, The University of Edinburgh 3Samsung AI Centre, Cambridge
Pseudocode	Yes	Algorithm 1 Online Meta-Critic Learning for Off P-AC RL
Open Source Code	Yes	Our demo code can be viewed on https://github.com/zwfightzw/Meta-Critic.
Open Datasets	Yes	We evaluate the methods on a suite of seven Mu Jo Co tasks [39] in Open AI Gym [4], two Mu Jo Co tasks in rllab [5], and a simulated racing car TORCS [22].
Dataset Splits	No	The paper mentions that 'dtrn and dval are different transition batches from replay buffer' and details how these batches are sampled for meta-training and meta-testing. However, it does not specify exact percentages or sample counts for train/validation splits of a fixed dataset.
Hardware Specification	No	The paper does not explicitly mention any specific hardware details such as GPU models, CPU types, or memory used for running the experiments.
Software Dependencies	No	The paper mentions using 'OpenAI Gym' and 'Mu Jo Co' tasks and refers to open-source implementations of DDPG, TD3, and SAC, but it does not provide specific version numbers for these software components or other libraries needed to replicate the experiments.
Experiment Setup	Yes	For our implementation of meta-critic, we use a three-layer neural network with an input dimension of π (300 in DDPG and TD3, 256 in SAC), two hidden feed-forward layers of 100 hidden nodes each, and Re LU non-linearity between layers. In Mu Jo Co cases we integrate our meta-critic with learning rate 0.001. The details of TORCS hyper-parameters are in the supplementary material.