Taylor TD-learning

Authors: Michele Garibbo, Maxime Robeyns, Laurence Aitchison

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We include theoretical and empirical evidence that Taylor TD updates are indeed lower variance than standard TD updates. Additionally, we show Taylor TD has the same stable learning guarantees as standard TD-learning with linear function approximation under a reasonable assumption. Next, we combine Taylor TD with the TD3 algorithm, forming Ta TD3. We show Ta TD3 performs as well, if not better, than several stateof-the-art model-free and model-based baseline algorithms on a set of standard benchmark tasks. 4 Experiments
Researcher Affiliation Academia Michele Garibbo Department of Engineering Mathematics University of Bristol Bristol, United Kingdom michele.garibbo@bristol.ac.uk Maxime Robeyns Department of Engineering Mathematics University of Bristol Bristol, United Kingdom maxime.robeyns.2018@bristol.ac.uk Laurence Aitchison Department of Engineering Mathematics University of Bristol Bristol, United Kingdom laurence.aitchison@bristol.ac.uk
Pseudocode Yes Algorithm 1: Taylor TD
Open Source Code Yes We combine Taylor TD (i.e. Algorithm 1) with the TD3 algorithm [1] in a model-based off-policy algorithm we call Taylor TD3 (Ta TD3) (code available at Appendix I). All the code is available at https://github.com/maximerobeyns/taylortd
Open Datasets Yes We employ 6 standard environments for continuous control. The first environment consists of a classic problem in control theory used to evaluate RL algorithms [i.e. Pendulum, 27]. The other 5 environments are stanard Mu Jo Co continous control tasks [i.e. Hopper, Half Cheetah, Walker2d, Ant and Humanoid, 28].
Dataset Splits No The paper uses standard continuous control environments (Pendulum, MuJoCo tasks) for which training and evaluation are typically done via interaction and episodes rather than explicit fixed train/validation/test dataset splits with specified percentages or counts.
Hardware Specification Yes All experiments were run on a cluster of GPUs, including NVIDIA Ge Force RTX 2080, 3090 and NDVIDIA A100.
Software Dependencies No The paper mentions using PyTorch for autodiff but does not specify version numbers for PyTorch or any other key software libraries or dependencies used in the experiments.
Experiment Setup Yes Below, we reported the hyperparameter settings for Ta TD3 (and sample-based Expected-TD3), Pendulum-v1 Half Cheetah-v2 Walker2d-v2 Steps 10000 150000 150000 Model ensemble size 8 8 8 Model architecture (MLP) 4 h-layers of size 512 4 h-layers of size 512 4 h-layers of size 512 Reward model architecture (MLP) 3 h-layers of size 256 3 h-layers of size 256 3 h-layers of size 256 Actor-critic architecture (MLP) 2 h-layers of size 400 2 h-layers of size 400 2 h-layers of size 400 Dyna steps per environment step 10 10 10 Model horizon 1 1 1 λa 0.25 0.25 0.25 λs 1e-5 1e-5 1e-5 Hopper-v2 Ant-v2 Humanoid-v2 Steps 10000 150000 150000 Model ensemble size 8 8 8 Model architecture (MLP) 4 h-layers of size 512 4 h-layers of size 512 4 h-layers of size 512 Reward model architecture (MLP) 3 h-layers of size 256 3 h-layers of size 512 3 h-layers of size 512 Actor-critic architecture (MLP) 2 h-layers of size 400 4 h-layers of size 400 4 h-layers of size 400 Dyna steps per environment step 10 10 10 Model horizon 1 1 1 λa 0.06 0.06 0.25 λs 1e-5 1e-5 1e-5