RotoGrad: Gradient Homogenization in Multitask Learning

Authors: Adrián Javaloy, Isabel Valera

ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Finally, we run extensive experiments to empirically demonstrate that Roto Grad leads to stable (convergent) learning, scales up to complex network architectures, and outperforms competing methods in multi-label classification settings in CIFAR10 and Celeb A, as well as in computer vision tasks using the NYUv2 dataset.
Researcher Affiliation Academia Adrián Javaloy Department of Computer Science Saarland University Saarbrücken, Germany ajavaloy@cs.uni-saarland.de
Pseudocode Yes Algorithm 1 Training step with Roto Grad.
Open Source Code Yes A Pytorch implementation can be found in https://github.com/adrianjav/rotograd.
Open Datasets Yes We test all methods on three different tasks of NYUv2 (Couprie et al., 2013)...We test Roto Grad on a 10-task classification problem on CIFAR10 (Krizhevsky et al., 2009)...we use a multitask version of MNIST (Le Cun et al., 2010)...and SVHN (Netzer et al., 2011)...We use Celeb A (Liu et al., 2015) as dataset with usual splits.
Dataset Splits Yes For the single training of a model, we select the parameters of the model by taking those that obtained the best validation error after each training epoch.
Hardware Specification Yes Computational resources. All experiments were performed on a shared cluster system with two NVIDIA DGX-A100. Therefore, all experiments were run with (up to) 4 cores of AMD EPYC 7742 CPUs and, for those trained on GPU (CIFAR10, Celeb A, and NYUv2), a single NVIDIA A100 GPU. All experiments were restricted to 12 GB of RAM.
Software Dependencies No The paper mentions software and libraries like PyTorch, RAdam, Adam, and Geotorch. However, it does not provide specific version numbers for these software components, which is required for reproducible software dependencies.
Experiment Setup Yes Model hyperparameters. For both datasets, we train the model for 300 epochs using a batch size of 1024. For the network parameters, we use RAdam (Liu et al., 2019a) with a learning rate of 1e 3.