Learning to Modulate pre-trained Models in RL

Authors: Thomas Schmied, Markus Hofmarcher, Fabian Paischer, Razvan Pascanu, Sepp Hochreiter

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We conduct an extensive evaluation of fine-tuning, parameter-efficient fine-tuning, and prompting methods for Transformers in RL. Then, we evaluate and compare a variety of fine-tuning methods prevalent in natural language processing, both in terms of performance on new tasks, and how well performance on pre-training tasks is retained.
Researcher Affiliation Collaboration 1 ELLIS Unit Linz and LIT AI Lab, Institute for Machine Learning, 2 JKU LIT SAL e SPML Lab, Institute for Machine Learning, Johannes Kepler University, Linz, Austria 3 Google Deep Mind, 4 UCL
Pseudocode No The paper describes methods using text and mathematical formulas (e.g., Equation 1, 2, 3, 4) but does not include structured pseudocode or algorithm blocks.
Open Source Code Yes Source code and datasets are available at: https://github.com/ml-jku/L2M
Open Datasets Yes Finally, to aid future research in this area, we release a dataset encompassing 50 Meta-World and 16 DMControl tasks. Source code and datasets are available at: https://github.com/ml-jku/L2M
Dataset Splits No The paper splits tasks into pre-training (MT40, DMC10) and fine-tuning (CW10, DMC6) sets, and describes data collection per task (e.g., '10K trajectories of length 200'), but does not specify train/validation/test dataset splits with percentages or sample counts for the collected data.
Hardware Specification Yes We run all our pre-training experiments on 4 NVIDIA A100 GPUs. For all our fine-tuning experiments, we use single GPU training on NVIDIA A100 or NVIDIA Titan V GPUs.
Software Dependencies No The paper mentions software like PyTorch, stable-baselines3, and the transformers library, but does not provide specific version numbers for these dependencies.
Experiment Setup Yes We train our MDDT for a total of 1M update steps, with context length of 5 transitions (45 tokens). We use a learning rate of 1e 4 and 4000 linear warm-up steps, followed by a cosine decay to 1e 6. Furthermore, we use gradient clip of 0.25, weight decay of 0.01, dropout of 0.2, a batch size of 1024 sequences and train using the Adam W optimizer (Loshchilov and Hutter, 2018).