Bayesian Uncertainty for Gradient Aggregation in Multi-Task Learning

Authors: Idan Achituve, Idit Diamant, Arnon Netzer, Gal Chechik, Ethan Fetaya

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We empirically demonstrate the benefits of our approach in a variety of datasets, achieving state-of-the-art performance. We evaluated Bayes Agg-MTL on several MTL benchmarks differing in the number of tasks and their types. The test results for this dataset are presented in Table 1.
Researcher Affiliation Collaboration 1Faculty of Engineering, Bar-Ilan University, Israel 2Sony Semiconductor Israel 3Department of Computer Science, Bar Ilan University, Israel.
Pseudocode Yes Algorithm 1 Bayes Agg-MTL Input: B a random batch of examples; p(wk|D) k [1, ..., K] posterior distributions over the task-specific parameters; s scaling hyper-parameter For i = 1, ..., |B|: For k = 1, ..., K: Compute E[gk i ] and E[gk i (gk i )T ] as in Eq. 4 for regression or Eq. 11 for classification. Set (operations are done element-wise), µk i := E[gk i ], λk i := (E[(gk i )2] E[gk i ]E[gk i ])) 1. End for Compute gi = PK k=1 (λk i )s PK k=1(λk i )s µk i . End for Compute gradient via matrix multiplication w.r.t the shared parameters: 1 |B| P|B| i=1 gi hi
Open Source Code Yes Our code is publicly available at https://github.com/ssi-research/ Bayes Agg MTL.
Open Datasets Yes We demonstrate our method effectiveness over baseline methods on the MTL benchmarks QM9 (Ramakrishnan et al., 2014), CIFAR-100 (Krizhevsky et al., 2009), Chest X-ray14 (Wang et al., 2017), and UTKFace (Zhang et al., 2017).
Dataset Splits Yes In all datasets, we pre-allocated a validation set from the training set for hyper-parameter tuning and early stopping for all methods. Specifically, we allocate approximately 110, 000 examples for training, with separate validation and testing sets with 10, 000 examples each. We use the official train-test split having 50, 000 examples and 10, 000 examples respectively. We allocate 5, 000 examples from the training set for a validation set. We use the official split of 70% 10% 20% for training, validation, and test. We split the dataset according to 70% 10% 20% to train, validation, and test datasets.
Hardware Specification Yes All the experiments were done using Py Torch on NVIDIA V100 and A100 GPUs having 32GB of memory.
Software Dependencies No The paper mentions 'Py Torch', 'ADAM optimizer', and 'Py Torch Image Models' but does not specify their exact version numbers, which is necessary for reproducibility.
Experiment Setup Yes Full experimental details are given in Appendix B. We used the ADAM optimizer (Kingma & Ba, 2014) with an initial lr of 1e 3. The batch size was set to 120. All methods were trained for 50 epochs using the ADAM optimizer, with an initial learning rate of 1e 3 and a scheduler that drops the learning rate by a factor of 0.1 at 60% and 80% of the training. We set the batch size to 128 and used a weight decay of 1e 4. Here we trained for 100 epochs, the batch size was set to 256, and we didn t use a weight decay. Here, we used Res Net-18 for the shared network.