Go Wider Instead of Deeper

Authors: Fuzhao Xue, Ziji Shi, Futao Wei, Yuxuan Lou, Yong Liu, Yang You8779-8787

AAAI 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental To evaluate our plug-and-run framework, we design Wide Net and conduct comprehensive experiments on popular computer vision and natural language processing benchmarks. On Image Net-1K, our best model outperforms Vision Transformer (Vi T) by 1.5% with 0.72 trainable parameters.
Researcher Affiliation Academia Department of Computer Science, National University of Singapore, Singapore {f.xue,ziji.shi}@u.nus.edu, weifutao2019@gmail.com, yuxuanlou@u.nus.edu, {liuyong,youy}@comp.nus.edu.sg
Pseudocode No No pseudocode or algorithm blocks are present in the paper.
Open Source Code No The paper does not provide a statement about releasing code or a link to a source code repository.
Open Datasets Yes We use ILSVRC-2012 Image Net (Deng et al. 2009) and Cifar10 as platforms to evaluate our framework.
Dataset Splits No While the paper mentions using a "development set" for GLUE, it does not provide specific split ratios or sample counts for training, validation, or test sets needed to reproduce the data partitioning. It also references following hyperparameters of baselines for fine-tuning, implying standard splits, but no explicit details are given.
Hardware Specification Yes We pretrain our models on 256 TPUv3 cores.
Software Dependencies Yes We first reimplement Vi T by Tensorflow 2.x and tune it to a reasonable performance.
Experiment Setup Yes For Mo E based models (i.e., Vi T-Mo E and Wide Net), we set the weight of load balance loss λ as 0.01. Without special instructions, we use 4 experts in total and Top 2 experts selected in each transformer block. The capacity ratio C is set as 1.2 for a trade-off between accuracy and speed. ... The learning rate is 0.00176, which is the same as ALBERT claimed (You et al. 2019a). During finetuning, we still follow (Dosovitskiy et al. 2020) and use SGD optimizer with momentum. Compared with pretraining on Image Net-1K, label smoothing and warm-up are removed.