Shallow-Deep Networks: Understanding and Mitigating Network Overthinking
Authors: Yigitcan Kaya, Sanghyun Hong, Tudor Dumitras
ICML 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We apply SDN to four modern architectures, trained on three image classification tasks, to characterize the overthinking problem. We show that SDNs can mitigate the wasteful effect of overthinking with confidence-based early exits, which reduce the average inference cost by more than 50% and preserve the accuracy. |
| Researcher Affiliation | Academia | 1University of Maryland, Maryland, USA. |
| Pseudocode | No | No explicit pseudocode or algorithm blocks were found in the paper. |
| Open Source Code | Yes | We also release all of our source code3. 3www.shallowdeep.network |
| Open Datasets | Yes | In our experiments, we use three datasets for benchmarking: CIFAR-10, CIFAR-100 (Krizhevsky & Hinton, 2009) and Tiny Image Net (Deng et al., 2009) |
| Dataset Splits | Yes | CIFAR-10 and CIFAR-100 images are drawn from 10 and 100 classes, respectively; containing 50,000 training and 10,000 validation images. The Tiny Image Net dataset consists of a subset of Image Net images (Deng et al., 2009), resized at 64x64 pixels. There are 200 classes, each of which has 500 training and 50 validation images. |
| Hardware Specification | No | The paper does not provide specific hardware details such as GPU/CPU models or cloud instance types used for running experiments. |
| Software Dependencies | No | The paper mentions using the Adam optimizer and refers to various network architectures, but does not provide specific software dependencies with version numbers (e.g., Python, PyTorch/TensorFlow versions). |
| Experiment Setup | Yes | We train the CNNs for 100 epochs, using the hyper-parameters the original studies describe. To apply SDNs to pretrained networks, we train the internal classifiers for 25 epochs, using the Adam optimizer (Kingma & Ba, 2014). If we start training a modified network from scratch, we train for 100 epochs; the same as the original networks. |