High-Capacity Expert Binary Networks

Authors: Adrian Bulat, Brais Martinez, Georgios Tzimiropoulos

ICLR 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Overall, our method improves upon prior work, with no increase in computational cost, by 6%, reaching a groundbreaking 71% on Image Net classification. Fig. 1b confirms this experimentally by t-SNE embedding visualisation of the features before the classifier along with the corresponding expert that was activated for each sample of the Image Net validation set.
Researcher Affiliation Collaboration Adrian Bulat Samsung AI Cambridge adrian@adrianbulat.com Brais Martinez Samsung AI Cambridge brais.a@samsung.com Georgios Tzimiropoulos Samsung AI Cambridge Queen Mary University of London, UK g.tzimiropoulos@qmul.ac.uk
Pseudocode No Overall, our optimization policy can be summarized as follows: 1. Train one expert, parametrized by θ0, using real weights and binary activations. 2. Replicate θ0 to all θi, i = {1, N 1} to initialize matrix Θ. 3. Train the model initialized in step 2 using real weights and binary activations. 4. Train the model obtained from step 3 using binary weights and activations. This is a descriptive list of steps, not pseudocode or an algorithm block.
Open Source Code No Code will be made available here.
Open Datasets Yes We compared our method against the current state-of-the-art in binary networks on the Image Net dataset (Deng et al., 2009). Additional comparisons, including on CIFAR-100 (Krizhevsky et al., 2009), can be found in the supplementary material in Section A.2.
Dataset Splits Yes Fig. 1b confirms this experimentally by t-SNE embedding visualisation of the features before the classifier along with the corresponding expert that was activated for each sample of the Image Net validation set. The images are augmented following the common strategy used in prior-work (He et al., 2016) by randomly scaling and cropping the images to a resolution of 224 224px.
Hardware Specification Yes All models were trained on 4 V100 GPUs and implemented using Py Torch (Paszke et al., 2019).
Software Dependencies No All models were trained on 4 V100 GPUs and implemented using Py Torch (Paszke et al., 2019). The mention of 'Py Torch' lacks a specific version number.
Experiment Setup Yes The training procedure largely follows that of Martinez et al. (2020). In particular, we trained our networks using Adam optimizer (Kingma & Ba, 2014) for 75 epochs using a learning rate of 10 3 that is decreased by 10 at epoch 40, 55 and 65. During Stage I, we set the weight decay to 10 5 and to 0 during Stage II. Furthermore, following Martinez et al. (2020), during the first 10 epochs, we apply a learning rate warm-up Goyal et al. (2017).