Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Reasonable Effectiveness of Random Weighting: A Litmus Test for Multi-Task Learning
Authors: Baijiong Lin, Feiyang Ye, Yu Zhang, Ivor Tsang
TMLR 2022 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | To show the eļ¬ectiveness and necessity of RW methods, theoretically, we analyze the convergence of RW and reveal that RW has a higher probability to escape local minima, resulting in better generalization ability. Empirically, we extensively evaluate the proposed RW methods to compare with twelve state-of-the-art methods on ļ¬ve image datasets and two multilingual problems from the XTREME benchmark to show that RW methods can achieve comparable performance with state-of-the-art baselines. |
| Researcher Affiliation | Academia | 1 Department of Computer Science and Engineering, Southern University of Science and Technology 2 Australian Artiļ¬cial Intelligence Institute, University of Technology Sydney 3 Centre for Frontier AI Research, A*STAR 4 Peng Cheng Laboratory |
| Pseudocode | Yes | The training algorithms of both RW methods are summarized in Algorithm 1. The only diļ¬erence between the RW methods and the existing works is the generation of loss/gradient weights (i.e., Line 7 in Algorithm 1). |
| Open Source Code | Yes | The implementations of the RW methods and the baseline methods are based on the open-source Lib MTL library (Lin & Zhang, 2022). |
| Open Datasets | Yes | On ļ¬ve Computer Vision (CV) datasets and two multilingual problems from the XTREME benchmark (Hu et al., 2020), we show that RW methods can consistently outperform EW and have competitive performance with existing SOTA methods. ... We consider three image classiļ¬cations datasets: the Multi-MNIST (Sabour et al., 2017), the Multi-Fashion MNIST, and the Multi-(Fashion+MNIST) datasets (Lin et al., 2019). ... The NYUv2 dataset (Silberman et al., 2012) is an indoor scene understanding dataset... The XTREME benchmark (Hu et al., 2020) is a large-scale multilingual multi-task benchmark... The datasets used in the PI and POS tasks are the PAWS-X dataset (Yang et al., 2019) and Universal Dependency v2.5 treebanks (Nivre et al., 2020), respectively. ... The City Scapes dataset (Cordts et al., 2016) is a large-scale urban street scene understanding dataset... The Celeb A dataset (Liu et al., 2015) is a large-scale face attributes dataset... The Oļ¬ce-31 dataset (Saenko et al., 2010)... The Oļ¬ce-Home dataset (Venkateswara et al., 2017) |
| Dataset Splits | Yes | The Multi-MNIST dataset...we use 120K and 20K images for training and testing, respectively. ... The NYUv2 dataset...It contains 795 and 654 images for training and testing, respectively. ... The XTREME benchmark...The statistics for each language are summarized in Table 2. ... The City Scapes dataset...It contains 2,975 and 500 annotated images for training and test, respectively. ... The Celeb A dataset...It is split into three parts: 162,770, 19,867, and 19,962 images for training, validation, and testing, respectively. ... The Oļ¬ce-31 dataset...We randomly split the whole dataset with 60% for training, 20% for validation, and the rest 20% for testing. The Oļ¬ce-Home dataset...We make the same split as the Oļ¬ce-31 dataset. |
| Hardware Specification | Yes | All the experiments are conducted on one single NVIDIA Ge Force RTX 3090 GPU. |
| Software Dependencies | No | The implementations of the RW methods and the baseline methods are based on the open-source Lib MTL library (Lin & Zhang, 2022). ... a pre-trained multilingual BERT (m BERT) model (Devlin et al., 2019) implemented via the open-source transformers library (Wolf et al., 2020)... |
| Experiment Setup | Yes | The SGD optimizer with the learning rate as 10 3 and the momentum as 0.9 is used for training, the batch size is set to 256, and the training epoch is set to 100. The cross-entropy loss is used for each task. ... For the NYUv2 dataset...The Adam optimizer (Kingma & Ba, 2015) with the learning rate as 10 4 and the weight decay as 10 5 is used for training and the batch size is set to 8. We use the cross-entropy loss, L1 loss, and cosine loss as the loss function... For each multilingual problem in the XTREME benchmark...The Adam optimizer with the learning rate as 2 10 5 and the weight decay as 10 8 is used for training and the batch size is set to 32. The cross-entropy loss is used for the two multilingual problems. ... We use the Adam optimizer with the learning rate as 10 4 and the weight decay as 10 5 and set the batch size to 128 for training. The cross-entropy loss is used for all tasks in both datasets. |