Dynamics Generalisation in Reinforcement Learning via Adaptive Context-Aware Policies

Authors: Michael Beukman, Devon Jarvis, Richard Klein, Steven James, Benjamin Rosman

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We show that the Decision Adapter is a useful generalisation of a previously proposed architecture and empirically demonstrate that it results in superior generalisation performance compared to previous approaches in several environments.
Researcher Affiliation Academia 1University of the Witwatersrand 2University of Oxford 3University College London
Pseudocode Yes Algorithm 1 Decision Adapter Changes to the standard forward pass in blue. 1: procedure ADAPTERFORWARD(s S, c C) 2: x1 = s 3: for i {1, 2, . . . , n} do 4: if Ai = null then 5: θi A = Hi(c) // Generate Weights 6: x i = Ai(xi|θi A) // Forward Pass 7: xi = xi + x i // Skip connection 8: end if 9: xi+1 = Li(xi) 10: end for 11: return xn+1 12: end procedure
Open Source Code Yes We publicly release code at https://github.com/Michael-Beukman/DecisionAdapter.
Open Datasets No The paper describes the environments (ODE, Cart Pole, Mujoco Ant) and their parameters, but does not provide specific links, DOIs, repository names, or formal citations for publicly available datasets used for training or evaluation, nor does it explicitly state the custom ODE environment's data is made public.
Dataset Splits No The paper discusses training and evaluation context sets, including interpolation and extrapolation ranges for evaluation, but does not specify a distinct validation dataset split for hyperparameter tuning or early stopping.
Hardware Specification Yes For compute, we used an internal cluster consisting of nodes with NVIDIA RTX 3090 GPUs.
Software Dependencies No We use the high-performing and standard implementation of SAC from the Clean RL library [98] with neural networks being written in Py Torch [99].
Experiment Setup Yes We use the default hyperparameters, which are listed in Table 2. Table 2: The default hyperparameters we use in the Clean RL implementation. Buffer Size 1 000 000 γ 0.99 τ 0.005 Batch Size 256 Exploration Noise 0.1 First Learning Timestep 5000 Policy Learning Rate 0.0003 Critic Learning Rate 0.001 Policy Update Frequency 2 Target Network Update Frequency 1 Noise Clip 0.5 Automatically Tune Entropy Yes