Deciding How to Decide: Dynamic Routing in Artificial Neural Networks
Authors: Mason McGill, Pietro Perona
ICML 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We propose three approaches to training these networks, test them on small image datasets synthesized from MNIST (Le Cun et al., 1998) and CIFAR-10 (Krizhevsky & Hinton, 2009), and quantify the accuracy/efficiency trade-off that occurs when the network parameters are tuned to yield more aggressive early classification policies. We compare approaches to dynamic routing by training 153 networks to classify small images, varying the policy-learning strategy, regularization strategy, optimization strategy, architecture, cost of computation, and details of the task. The results of these experiments are reported in Fig. 5 10. |
| Researcher Affiliation | Academia | Mason Mc Gill 1 Pietro Perona 1 1California Institute of Technology, Pasadena, California, USA. |
| Pseudocode | No | The paper describes its methods verbally and mathematically but does not include any explicit pseudocode or algorithm blocks. |
| Open Source Code | Yes | Our code is available via Git Lab. |
| Open Datasets | Yes | we train networks to classify images from a small-image dataset synthesized from MNIST (Le Cun et al., 1998) and CIFAR10 (Krizhevsky & Hinton, 2009) |
| Dataset Splits | No | The paper mentions training iterations, mini-batch size, and the use of validation images, but does not provide specific train/validation/test dataset split percentages or counts. |
| Hardware Specification | No | The paper discusses computational cost and efficiency but does not provide any specific details about the hardware (e.g., GPU/CPU models, memory) used for the experiments. |
| Software Dependencies | No | The paper mentions various techniques like batch normalization and Xavier initialization, but does not list any specific software dependencies with version numbers (e.g., PyTorch 1.9, Python 3.8). |
| Experiment Setup | Yes | In all of our experiments, we use a mini-batch size, nex, of 128, and run 80,000 training iterations. We perform stochastic gradient descent with initial learning rate 0.1/nex and momentum 0.9. The learning rate decays continuously with a half-life of 10,000 iterations. [...] τ is initialized to 1.0 for actor networks and 0.1 for critic networks, and decays with a half-life of 10,000 iterations. kdec = 0.01, kure = 0.001, and k L2 = 1 10 4. |