Routing Networks: Adaptive Selection of Non-Linear Functions for Multi-Task Learning
Authors: Clemens Rosenbaum, Tim Klinger, Matthew Riemer
ICLR 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate our model against cross-stitch networks and shared-layer baselines on multi-task settings of the MNIST, mini-imagenet, and CIFAR-100 datasets. Our experiments demonstrate a significant improvement in accuracy, with sharper convergence. |
| Researcher Affiliation | Collaboration | Clemens Rosenbaum College of Information and Computer Sciences University of Massachusetts Amherst 140 Governors Dr., Amherst, MA 01003 cgbr@cs.umass.edu Tim Klinger & Matthew Riemer IBM Research AI 1101 Kitchawan Rd, Yorktown Heights, NY 10598 {tklinger,mdriemer}@us.ibm.com |
| Pseudocode | Yes | Algorithm 1: Routing Algorithm; Algorithm 2: Router-Trainer: Training of a Routing Network.; Algorithm 3: Weighted Policy Learner |
| Open Source Code | No | All dataset splits and the code will be released with the publication of this paper. |
| Open Datasets | Yes | We experiment with three datasets: multi-task versions of MNIST (MNIST-MTL) (Lecun et al., 1998), Mini-Imagenet (MIN-MTL) (Vinyals et al., 2016) as introduced by (Ravi & Larochelle, 2017), and CIFAR-100 (CIFAR-MTL) (Krizhevsky, 2009) where we treat the 20 superclasses as tasks. |
| Dataset Splits | No | The paper provides training and testing split sizes in Table 1 and within the text for each dataset, but it does not specify a separate validation split with quantitative details. |
| Hardware Specification | No | The paper mentions 'training time on a stable compute cluster' but does not provide specific hardware details such as GPU models, CPU types, or memory specifications. |
| Software Dependencies | No | The paper mentions using SGD and Adam optimizers but does not provide specific version numbers for any software libraries, frameworks, or programming languages used (e.g., PyTorch, TensorFlow, Python version). |
| Experiment Setup | Yes | We use ρ = 0.0 (no collaboration reward) for CIFAR-MTL and MIN-MTL and ρ = 0.3 for MNIST-MTL. The learning rate is initialized to 10^-2 and annealed by dividing by 10 every 20 epochs. We tried both regular SGD as well as Adam Kingma & Ba (2014), but chose SGD as it resulted in marginally better performance. The Simple Conv Net has batch normalization layers but we use no dropout. |