Scalarization for Multi-Task and Multi-Domain Learning at Scale

Authors: Amelie Royer, Tijmen Blankevoort, Babak Ehteshami Bejnordi

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this work, we first devise a large-scale unified analysis of multi-domain and multi-task learning to better understand the dynamics of scalarization across varied task/domain combinations and model sizes. Following these insights, we then propose to leverage population-based training to efficiently search for the optimal scalarization weights when dealing with a large number of tasks or domains. We perform a large-scale analysis of scalarization for both multi-task (MTL) and multi-domain learning (MDL). We cover a wide range of model capacities, datasets with varying sizes, and different task/domain combinations. Our key conclusions are as follows:
Researcher Affiliation Industry Amélie Royer, Tijmen Blankevoort, Babak Ehteshami Bejnordi Qualcomm AI Research Amsterdam, The Netherlands {aroyer, tijmen, behtesha}@qti.qualcomm.com
Pseudocode No No structured pseudocode or algorithm block was found within the paper.
Open Source Code No The paper states, "In practice we use the implementation of [31] for all MTO methods." This indicates the use of existing, third-party implementations, not the release of the authors' own open-source code for the methodology described in this paper. No explicit statement of code release or link to their own repository was found.
Open Datasets Yes For MDL, we first consider a two-domain example composed of CIFAR10 [29] and STL10 [9]...We then expand the results to the Domain Net benchmark [47]...For MTL, we use Celeb A [40]...Finally, we experiment on a larger MTL setting for dense prediction...on the challenging Taskonomy dataset [65].
Dataset Splits Yes For PBT results, we first run the search algorithm using the implementation from Raytune [33]. We use 70% of the training set for training, and use the remaining 30% to rank models in the population by measuring their average accuracy on this set.
Hardware Specification Yes Unless stated, every experiment is conducted on a single NVIDIA V100 GPU. Finally, we train each model from scratch on a single Nvidia V100 GPU with a batch size of 256 images for 300 epochs...
Software Dependencies No The paper mentions specific optimizers (e.g., "Adam W optimizer") and tools (e.g., "Raytune [33]") but does not provide specific version numbers for these or other software libraries (e.g., deep learning frameworks like PyTorch or TensorFlow).
Experiment Setup Yes We use a vision transformer backbone (Vi T-S)...To control model capacity, we vary the depth (number of transformer layers) in {3, 6, 9} and the width (token dimension) in {48, 96, 144, 192}, Finally, we train each model...with a batch size of 256 images for 300 epochs (including 30 epochs of linear learning rate warmup), using a learning rate of 0.001 and weight decay of 0.05 with the Adam W optimizer and cosine learning rate decay. Following these results, we use a learning rate of 0.03 and train for 30 epochs with a batch size of 512 in subsequent experiments. We train with the Adam W optimizer with a weight decay of 1e 4. We also apply linear learning rate warm-up during the first five training epochs and use cosine schedule learning rate decay for the rest of the training.