Go Wider Instead of Deeper
Authors: Fuzhao Xue, Ziji Shi, Futao Wei, Yuxuan Lou, Yong Liu, Yang You8779-8787
AAAI 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | To evaluate our plug-and-run framework, we design Wide Net and conduct comprehensive experiments on popular computer vision and natural language processing benchmarks. On Image Net-1K, our best model outperforms Vision Transformer (Vi T) by 1.5% with 0.72 trainable parameters. |
| Researcher Affiliation | Academia | Department of Computer Science, National University of Singapore, Singapore {f.xue,ziji.shi}@u.nus.edu, weifutao2019@gmail.com, yuxuanlou@u.nus.edu, {liuyong,youy}@comp.nus.edu.sg |
| Pseudocode | No | No pseudocode or algorithm blocks are present in the paper. |
| Open Source Code | No | The paper does not provide a statement about releasing code or a link to a source code repository. |
| Open Datasets | Yes | We use ILSVRC-2012 Image Net (Deng et al. 2009) and Cifar10 as platforms to evaluate our framework. |
| Dataset Splits | No | While the paper mentions using a "development set" for GLUE, it does not provide specific split ratios or sample counts for training, validation, or test sets needed to reproduce the data partitioning. It also references following hyperparameters of baselines for fine-tuning, implying standard splits, but no explicit details are given. |
| Hardware Specification | Yes | We pretrain our models on 256 TPUv3 cores. |
| Software Dependencies | Yes | We first reimplement Vi T by Tensorflow 2.x and tune it to a reasonable performance. |
| Experiment Setup | Yes | For Mo E based models (i.e., Vi T-Mo E and Wide Net), we set the weight of load balance loss λ as 0.01. Without special instructions, we use 4 experts in total and Top 2 experts selected in each transformer block. The capacity ratio C is set as 1.2 for a trade-off between accuracy and speed. ... The learning rate is 0.00176, which is the same as ALBERT claimed (You et al. 2019a). During finetuning, we still follow (Dosovitskiy et al. 2020) and use SGD optimizer with momentum. Compared with pretraining on Image Net-1K, label smoothing and warm-up are removed. |