OmniNet: Omnidirectional Representations from Transformers
Authors: Yi Tay, Mostafa Dehghani, Vamsi Aribandi, Jai Gupta, Philip M Pham, Zhen Qin, Dara Bahri, Da-Cheng Juan, Donald Metzler
ICML 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments are conducted on autoregressive language modeling (LM1B, C4), Machine Translation, Long Range Arena (LRA), and Image Recognition. |
| Researcher Affiliation | Industry | 1Google Research, Mountain View 2Google Brain Team, Amsterdam 3Google AI Resident. |
| Pseudocode | No | The paper does not contain explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not contain any explicit statement about releasing the source code for the described methodology. |
| Open Datasets | Yes | We use two large-scale datasets, language modeling one billion (LM1B) (Chelba et al., 2013) and the Colossal Cleaned Common Crawl corpus (C4) (Raffel et al., 2019). We use five collections/datasets from WMT-17... we pre-train our Omni Net models on the JFT dataset (Sun et al., 2017). |
| Dataset Splits | Yes | For both tasks, we use a max length of 256 subword tokens per example and report scores on subword perplexity on the validation set. We evaluate our models in the transfer setup (few-shot and fine-tuning) on several downstream tasks: Image Net, CIFAR-10, CIFAR-100 (Krizhevsky et al., 2009), Oxford-IIIT Pets (Parkhi et al., 2012), and Oxford Flowers-102 (Nilsback & Zisserman, 2008). |
| Hardware Specification | Yes | For both tasks, we train all models for 30K for LM1b and 100K steps for C4 using 16 TPU-V3 Chips. |
| Software Dependencies | No | Our implementation uses Flax (Heek et al., 2020) and Jax (Bradbury et al., 2018). While frameworks are mentioned, specific version numbers are not provided. |
| Experiment Setup | Yes | Models are of base size and have an embedding dimension of 512, 8 heads, 6 layers and hidden dimensions (MLP) of 2048. During pre-training, we use a batch size of 4096 using Adam with β1 =0.9 and β2 =0.999, and use a weight decay of 0.05 for Omni Net. We use a learning rate of 8e 4 with a linear decay and a linear warmup of 10K steps. |