High-Capacity Expert Binary Networks
Authors: Adrian Bulat, Brais Martinez, Georgios Tzimiropoulos
ICLR 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Overall, our method improves upon prior work, with no increase in computational cost, by 6%, reaching a groundbreaking 71% on Image Net classification. Fig. 1b confirms this experimentally by t-SNE embedding visualisation of the features before the classifier along with the corresponding expert that was activated for each sample of the Image Net validation set. |
| Researcher Affiliation | Collaboration | Adrian Bulat Samsung AI Cambridge adrian@adrianbulat.com Brais Martinez Samsung AI Cambridge brais.a@samsung.com Georgios Tzimiropoulos Samsung AI Cambridge Queen Mary University of London, UK g.tzimiropoulos@qmul.ac.uk |
| Pseudocode | No | Overall, our optimization policy can be summarized as follows: 1. Train one expert, parametrized by θ0, using real weights and binary activations. 2. Replicate θ0 to all θi, i = {1, N 1} to initialize matrix Θ. 3. Train the model initialized in step 2 using real weights and binary activations. 4. Train the model obtained from step 3 using binary weights and activations. This is a descriptive list of steps, not pseudocode or an algorithm block. |
| Open Source Code | No | Code will be made available here. |
| Open Datasets | Yes | We compared our method against the current state-of-the-art in binary networks on the Image Net dataset (Deng et al., 2009). Additional comparisons, including on CIFAR-100 (Krizhevsky et al., 2009), can be found in the supplementary material in Section A.2. |
| Dataset Splits | Yes | Fig. 1b confirms this experimentally by t-SNE embedding visualisation of the features before the classifier along with the corresponding expert that was activated for each sample of the Image Net validation set. The images are augmented following the common strategy used in prior-work (He et al., 2016) by randomly scaling and cropping the images to a resolution of 224 224px. |
| Hardware Specification | Yes | All models were trained on 4 V100 GPUs and implemented using Py Torch (Paszke et al., 2019). |
| Software Dependencies | No | All models were trained on 4 V100 GPUs and implemented using Py Torch (Paszke et al., 2019). The mention of 'Py Torch' lacks a specific version number. |
| Experiment Setup | Yes | The training procedure largely follows that of Martinez et al. (2020). In particular, we trained our networks using Adam optimizer (Kingma & Ba, 2014) for 75 epochs using a learning rate of 10 3 that is decreased by 10 at epoch 40, 55 and 65. During Stage I, we set the weight decay to 10 5 and to 0 during Stage II. Furthermore, following Martinez et al. (2020), during the first 10 epochs, we apply a learning rate warm-up Goyal et al. (2017). |