OmniNet: Omnidirectional Representations from Transformers

Authors: Yi Tay, Mostafa Dehghani, Vamsi Aribandi, Jai Gupta, Philip M Pham, Zhen Qin, Dara Bahri, Da-Cheng Juan, Donald Metzler

ICML 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments are conducted on autoregressive language modeling (LM1B, C4), Machine Translation, Long Range Arena (LRA), and Image Recognition.
Researcher Affiliation Industry 1Google Research, Mountain View 2Google Brain Team, Amsterdam 3Google AI Resident.
Pseudocode No The paper does not contain explicitly labeled pseudocode or algorithm blocks.
Open Source Code No The paper does not contain any explicit statement about releasing the source code for the described methodology.
Open Datasets Yes We use two large-scale datasets, language modeling one billion (LM1B) (Chelba et al., 2013) and the Colossal Cleaned Common Crawl corpus (C4) (Raffel et al., 2019). We use five collections/datasets from WMT-17... we pre-train our Omni Net models on the JFT dataset (Sun et al., 2017).
Dataset Splits Yes For both tasks, we use a max length of 256 subword tokens per example and report scores on subword perplexity on the validation set. We evaluate our models in the transfer setup (few-shot and fine-tuning) on several downstream tasks: Image Net, CIFAR-10, CIFAR-100 (Krizhevsky et al., 2009), Oxford-IIIT Pets (Parkhi et al., 2012), and Oxford Flowers-102 (Nilsback & Zisserman, 2008).
Hardware Specification Yes For both tasks, we train all models for 30K for LM1b and 100K steps for C4 using 16 TPU-V3 Chips.
Software Dependencies No Our implementation uses Flax (Heek et al., 2020) and Jax (Bradbury et al., 2018). While frameworks are mentioned, specific version numbers are not provided.
Experiment Setup Yes Models are of base size and have an embedding dimension of 512, 8 heads, 6 layers and hidden dimensions (MLP) of 2048. During pre-training, we use a batch size of 4096 using Adam with β1 =0.9 and β2 =0.999, and use a weight decay of 0.05 for Omni Net. We use a learning rate of 8e 4 with a linear decay and a linear warmup of 10K steps.