Scalarization for Multi-Task and Multi-Domain Learning at Scale
Authors: Amelie Royer, Tijmen Blankevoort, Babak Ehteshami Bejnordi
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this work, we first devise a large-scale unified analysis of multi-domain and multi-task learning to better understand the dynamics of scalarization across varied task/domain combinations and model sizes. Following these insights, we then propose to leverage population-based training to efficiently search for the optimal scalarization weights when dealing with a large number of tasks or domains. We perform a large-scale analysis of scalarization for both multi-task (MTL) and multi-domain learning (MDL). We cover a wide range of model capacities, datasets with varying sizes, and different task/domain combinations. Our key conclusions are as follows: |
| Researcher Affiliation | Industry | Amélie Royer, Tijmen Blankevoort, Babak Ehteshami Bejnordi Qualcomm AI Research Amsterdam, The Netherlands {aroyer, tijmen, behtesha}@qti.qualcomm.com |
| Pseudocode | No | No structured pseudocode or algorithm block was found within the paper. |
| Open Source Code | No | The paper states, "In practice we use the implementation of [31] for all MTO methods." This indicates the use of existing, third-party implementations, not the release of the authors' own open-source code for the methodology described in this paper. No explicit statement of code release or link to their own repository was found. |
| Open Datasets | Yes | For MDL, we first consider a two-domain example composed of CIFAR10 [29] and STL10 [9]...We then expand the results to the Domain Net benchmark [47]...For MTL, we use Celeb A [40]...Finally, we experiment on a larger MTL setting for dense prediction...on the challenging Taskonomy dataset [65]. |
| Dataset Splits | Yes | For PBT results, we first run the search algorithm using the implementation from Raytune [33]. We use 70% of the training set for training, and use the remaining 30% to rank models in the population by measuring their average accuracy on this set. |
| Hardware Specification | Yes | Unless stated, every experiment is conducted on a single NVIDIA V100 GPU. Finally, we train each model from scratch on a single Nvidia V100 GPU with a batch size of 256 images for 300 epochs... |
| Software Dependencies | No | The paper mentions specific optimizers (e.g., "Adam W optimizer") and tools (e.g., "Raytune [33]") but does not provide specific version numbers for these or other software libraries (e.g., deep learning frameworks like PyTorch or TensorFlow). |
| Experiment Setup | Yes | We use a vision transformer backbone (Vi T-S)...To control model capacity, we vary the depth (number of transformer layers) in {3, 6, 9} and the width (token dimension) in {48, 96, 144, 192}, Finally, we train each model...with a batch size of 256 images for 300 epochs (including 30 epochs of linear learning rate warmup), using a learning rate of 0.001 and weight decay of 0.05 with the Adam W optimizer and cosine learning rate decay. Following these results, we use a learning rate of 0.03 and train for 30 epochs with a batch size of 512 in subsequent experiments. We train with the Adam W optimizer with a weight decay of 1e 4. We also apply linear learning rate warm-up during the first five training epochs and use cosine schedule learning rate decay for the rest of the training. |