Routing Networks: Adaptive Selection of Non-Linear Functions for Multi-Task Learning

Authors: Clemens Rosenbaum, Tim Klinger, Matthew Riemer

ICLR 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate our model against cross-stitch networks and shared-layer baselines on multi-task settings of the MNIST, mini-imagenet, and CIFAR-100 datasets. Our experiments demonstrate a significant improvement in accuracy, with sharper convergence.
Researcher Affiliation Collaboration Clemens Rosenbaum College of Information and Computer Sciences University of Massachusetts Amherst 140 Governors Dr., Amherst, MA 01003 cgbr@cs.umass.edu Tim Klinger & Matthew Riemer IBM Research AI 1101 Kitchawan Rd, Yorktown Heights, NY 10598 {tklinger,mdriemer}@us.ibm.com
Pseudocode Yes Algorithm 1: Routing Algorithm; Algorithm 2: Router-Trainer: Training of a Routing Network.; Algorithm 3: Weighted Policy Learner
Open Source Code No All dataset splits and the code will be released with the publication of this paper.
Open Datasets Yes We experiment with three datasets: multi-task versions of MNIST (MNIST-MTL) (Lecun et al., 1998), Mini-Imagenet (MIN-MTL) (Vinyals et al., 2016) as introduced by (Ravi & Larochelle, 2017), and CIFAR-100 (CIFAR-MTL) (Krizhevsky, 2009) where we treat the 20 superclasses as tasks.
Dataset Splits No The paper provides training and testing split sizes in Table 1 and within the text for each dataset, but it does not specify a separate validation split with quantitative details.
Hardware Specification No The paper mentions 'training time on a stable compute cluster' but does not provide specific hardware details such as GPU models, CPU types, or memory specifications.
Software Dependencies No The paper mentions using SGD and Adam optimizers but does not provide specific version numbers for any software libraries, frameworks, or programming languages used (e.g., PyTorch, TensorFlow, Python version).
Experiment Setup Yes We use ρ = 0.0 (no collaboration reward) for CIFAR-MTL and MIN-MTL and ρ = 0.3 for MNIST-MTL. The learning rate is initialized to 10^-2 and annealed by dividing by 10 every 20 epochs. We tried both regular SGD as well as Adam Kingma & Ba (2014), but chose SGD as it resulted in marginally better performance. The Simple Conv Net has batch normalization layers but we use no dropout.