Capsules with Inverted Dot-Product Attention Routing
Authors: Yao-Hung Hubert Tsai, Nitish Srivastava, Hanlin Goh, Ruslan Salakhutdinov
ICLR 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our model achieves comparable performance as the state-of-the-art convolutional neural networks (CNNs), but with much fewer parameters, on CIFAR-10 (95.14% test accuracy) and CIFAR-100 (78.02% test accuracy). We also introduce a challenging task to recognize single and multiple overlapping objects simultaneously. Sections 5 and 6 are titled 'EXPERIMENTS ON CIFAR-10 AND CIFAR-100' and 'EXPERIMENTS ON DIVERSEMULTIMNIST' respectively, detailing empirical evaluations. |
| Researcher Affiliation | Collaboration | 1Apple Inc., 2Carnegie Mellon University |
| Pseudocode | Yes | Procedure 1 Inverted Dot-product Attention Routing algorithm returns updated poses of the capsules in layer L + 1 given poses in layer L and L + 1 and weights between layer L and L + 1. Procedure 2 Inference. Inference returns class logits given input images and parameters for the model. |
| Open Source Code | Yes | Our code is publicly available at: https://github. com/apple/ml-capsules-inverted-attention-routing. An alternative implementation is available at: https://github. com/ yaohungt/Capsules-Inverted-Attention-Routing/blob/ master/README.md. |
| Open Datasets | Yes | CIFAR-10 and CIFAR-100 datasets (Krizhevsky et al., 2009) consist of small 32 32 real-world color images with 50, 000 for training and 10, 000 for evaluation. To this end, we construct the Diverse Multi MNIST dataset which is extended from MNIST (Le Cun et al., 1998). |
| Dataset Splits | No | The paper specifies '50,000 for training and 10,000 for evaluation' for CIFAR-10/100, and '10,000 test images' for Diverse Multi MNIST, but does not explicitly define a separate 'validation' split or a three-way split with counts/percentages for all datasets. |
| Hardware Specification | No | The paper states 'All the model is trained on a 8-GPU machine with batch size 128.' and 'For fairness, the numbers are benchmarked using the same 8-GPU machine with batch size 128.', but does not specify the type or model of the GPUs (e.g., NVIDIA A100, Tesla V100). |
| Software Dependencies | No | The paper mentions optimizers like 'Adam (Kingma & Ba, 2014)' and 'stochastic gradient descent' but does not provide specific version numbers for any software dependencies, libraries, or frameworks used (e.g., PyTorch, TensorFlow, or specific Adam implementation versions). |
| Experiment Setup | Yes | For the optimizers, we use stochastic gradient descent with learning rate 0.1... We use Adam... with learning rate 0.001... We decrease the learning rate by 10 times when the model trained on 150 epochs and 250 epochs, and there are 350 epochs in total. All the model is trained on a 8-GPU machine with batch size 128. During training, we first pad four zero-value pixels to each image and randomly crop the image to the size 32 32. Then, we horizontally flip the image with probability 0.5. Detailed model specifications are provided in Tables 2-12. |