Learning to Modulate pre-trained Models in RL
Authors: Thomas Schmied, Markus Hofmarcher, Fabian Paischer, Razvan Pascanu, Sepp Hochreiter
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We conduct an extensive evaluation of fine-tuning, parameter-efficient fine-tuning, and prompting methods for Transformers in RL. Then, we evaluate and compare a variety of fine-tuning methods prevalent in natural language processing, both in terms of performance on new tasks, and how well performance on pre-training tasks is retained. |
| Researcher Affiliation | Collaboration | 1 ELLIS Unit Linz and LIT AI Lab, Institute for Machine Learning, 2 JKU LIT SAL e SPML Lab, Institute for Machine Learning, Johannes Kepler University, Linz, Austria 3 Google Deep Mind, 4 UCL |
| Pseudocode | No | The paper describes methods using text and mathematical formulas (e.g., Equation 1, 2, 3, 4) but does not include structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | Source code and datasets are available at: https://github.com/ml-jku/L2M |
| Open Datasets | Yes | Finally, to aid future research in this area, we release a dataset encompassing 50 Meta-World and 16 DMControl tasks. Source code and datasets are available at: https://github.com/ml-jku/L2M |
| Dataset Splits | No | The paper splits tasks into pre-training (MT40, DMC10) and fine-tuning (CW10, DMC6) sets, and describes data collection per task (e.g., '10K trajectories of length 200'), but does not specify train/validation/test dataset splits with percentages or sample counts for the collected data. |
| Hardware Specification | Yes | We run all our pre-training experiments on 4 NVIDIA A100 GPUs. For all our fine-tuning experiments, we use single GPU training on NVIDIA A100 or NVIDIA Titan V GPUs. |
| Software Dependencies | No | The paper mentions software like PyTorch, stable-baselines3, and the transformers library, but does not provide specific version numbers for these dependencies. |
| Experiment Setup | Yes | We train our MDDT for a total of 1M update steps, with context length of 5 transitions (45 tokens). We use a learning rate of 1e 4 and 4000 linear warm-up steps, followed by a cosine decay to 1e 6. Furthermore, we use gradient clip of 0.25, weight decay of 0.01, dropout of 0.2, a batch size of 1024 sequences and train using the Adam W optimizer (Loshchilov and Hutter, 2018). |