Taylor TD-learning
Authors: Michele Garibbo, Maxime Robeyns, Laurence Aitchison
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We include theoretical and empirical evidence that Taylor TD updates are indeed lower variance than standard TD updates. Additionally, we show Taylor TD has the same stable learning guarantees as standard TD-learning with linear function approximation under a reasonable assumption. Next, we combine Taylor TD with the TD3 algorithm, forming Ta TD3. We show Ta TD3 performs as well, if not better, than several stateof-the-art model-free and model-based baseline algorithms on a set of standard benchmark tasks. 4 Experiments |
| Researcher Affiliation | Academia | Michele Garibbo Department of Engineering Mathematics University of Bristol Bristol, United Kingdom michele.garibbo@bristol.ac.uk Maxime Robeyns Department of Engineering Mathematics University of Bristol Bristol, United Kingdom maxime.robeyns.2018@bristol.ac.uk Laurence Aitchison Department of Engineering Mathematics University of Bristol Bristol, United Kingdom laurence.aitchison@bristol.ac.uk |
| Pseudocode | Yes | Algorithm 1: Taylor TD |
| Open Source Code | Yes | We combine Taylor TD (i.e. Algorithm 1) with the TD3 algorithm [1] in a model-based off-policy algorithm we call Taylor TD3 (Ta TD3) (code available at Appendix I). All the code is available at https://github.com/maximerobeyns/taylortd |
| Open Datasets | Yes | We employ 6 standard environments for continuous control. The first environment consists of a classic problem in control theory used to evaluate RL algorithms [i.e. Pendulum, 27]. The other 5 environments are stanard Mu Jo Co continous control tasks [i.e. Hopper, Half Cheetah, Walker2d, Ant and Humanoid, 28]. |
| Dataset Splits | No | The paper uses standard continuous control environments (Pendulum, MuJoCo tasks) for which training and evaluation are typically done via interaction and episodes rather than explicit fixed train/validation/test dataset splits with specified percentages or counts. |
| Hardware Specification | Yes | All experiments were run on a cluster of GPUs, including NVIDIA Ge Force RTX 2080, 3090 and NDVIDIA A100. |
| Software Dependencies | No | The paper mentions using PyTorch for autodiff but does not specify version numbers for PyTorch or any other key software libraries or dependencies used in the experiments. |
| Experiment Setup | Yes | Below, we reported the hyperparameter settings for Ta TD3 (and sample-based Expected-TD3), Pendulum-v1 Half Cheetah-v2 Walker2d-v2 Steps 10000 150000 150000 Model ensemble size 8 8 8 Model architecture (MLP) 4 h-layers of size 512 4 h-layers of size 512 4 h-layers of size 512 Reward model architecture (MLP) 3 h-layers of size 256 3 h-layers of size 256 3 h-layers of size 256 Actor-critic architecture (MLP) 2 h-layers of size 400 2 h-layers of size 400 2 h-layers of size 400 Dyna steps per environment step 10 10 10 Model horizon 1 1 1 λa 0.25 0.25 0.25 λs 1e-5 1e-5 1e-5 Hopper-v2 Ant-v2 Humanoid-v2 Steps 10000 150000 150000 Model ensemble size 8 8 8 Model architecture (MLP) 4 h-layers of size 512 4 h-layers of size 512 4 h-layers of size 512 Reward model architecture (MLP) 3 h-layers of size 256 3 h-layers of size 512 3 h-layers of size 512 Actor-critic architecture (MLP) 2 h-layers of size 400 4 h-layers of size 400 4 h-layers of size 400 Dyna steps per environment step 10 10 10 Model horizon 1 1 1 λa 0.06 0.06 0.25 λs 1e-5 1e-5 1e-5 |