Deciding How to Decide: Dynamic Routing in Artificial Neural Networks

Authors: Mason McGill, Pietro Perona

ICML 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We propose three approaches to training these networks, test them on small image datasets synthesized from MNIST (Le Cun et al., 1998) and CIFAR-10 (Krizhevsky & Hinton, 2009), and quantify the accuracy/efficiency trade-off that occurs when the network parameters are tuned to yield more aggressive early classification policies. We compare approaches to dynamic routing by training 153 networks to classify small images, varying the policy-learning strategy, regularization strategy, optimization strategy, architecture, cost of computation, and details of the task. The results of these experiments are reported in Fig. 5 10.
Researcher Affiliation Academia Mason Mc Gill 1 Pietro Perona 1 1California Institute of Technology, Pasadena, California, USA.
Pseudocode No The paper describes its methods verbally and mathematically but does not include any explicit pseudocode or algorithm blocks.
Open Source Code Yes Our code is available via Git Lab.
Open Datasets Yes we train networks to classify images from a small-image dataset synthesized from MNIST (Le Cun et al., 1998) and CIFAR10 (Krizhevsky & Hinton, 2009)
Dataset Splits No The paper mentions training iterations, mini-batch size, and the use of validation images, but does not provide specific train/validation/test dataset split percentages or counts.
Hardware Specification No The paper discusses computational cost and efficiency but does not provide any specific details about the hardware (e.g., GPU/CPU models, memory) used for the experiments.
Software Dependencies No The paper mentions various techniques like batch normalization and Xavier initialization, but does not list any specific software dependencies with version numbers (e.g., PyTorch 1.9, Python 3.8).
Experiment Setup Yes In all of our experiments, we use a mini-batch size, nex, of 128, and run 80,000 training iterations. We perform stochastic gradient descent with initial learning rate 0.1/nex and momentum 0.9. The learning rate decays continuously with a half-life of 10,000 iterations. [...] τ is initialized to 1.0 for actor networks and 0.1 for critic networks, and decays with a half-life of 10,000 iterations. kdec = 0.01, kure = 0.001, and k L2 = 1 10 4.