reproducibilityindex.ai

OmniNet: Omnidirectional Representations from Transformers

Authors: Yi Tay, Mostafa Dehghani, Vamsi Aribandi, Jai Gupta, Philip M Pham, Zhen Qin, Dara Bahri, Da-Cheng Juan, Donald Metzler

ICML 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments are conducted on autoregressive language modeling (LM1B, C4), Machine Translation, Long Range Arena (LRA), and Image Recognition.
Researcher Affiliation	Industry	1Google Research, Mountain View 2Google Brain Team, Amsterdam 3Google AI Resident.
Pseudocode	No	The paper does not contain explicitly labeled pseudocode or algorithm blocks.
Open Source Code	No	The paper does not contain any explicit statement about releasing the source code for the described methodology.
Open Datasets	Yes	We use two large-scale datasets, language modeling one billion (LM1B) (Chelba et al., 2013) and the Colossal Cleaned Common Crawl corpus (C4) (Raffel et al., 2019). We use ﬁve collections/datasets from WMT-17... we pre-train our Omni Net models on the JFT dataset (Sun et al., 2017).
Dataset Splits	Yes	For both tasks, we use a max length of 256 subword tokens per example and report scores on subword perplexity on the validation set. We evaluate our models in the transfer setup (few-shot and ﬁne-tuning) on several downstream tasks: Image Net, CIFAR-10, CIFAR-100 (Krizhevsky et al., 2009), Oxford-IIIT Pets (Parkhi et al., 2012), and Oxford Flowers-102 (Nilsback & Zisserman, 2008).
Hardware Specification	Yes	For both tasks, we train all models for 30K for LM1b and 100K steps for C4 using 16 TPU-V3 Chips.
Software Dependencies	No	Our implementation uses Flax (Heek et al., 2020) and Jax (Bradbury et al., 2018). While frameworks are mentioned, specific version numbers are not provided.
Experiment Setup	Yes	Models are of base size and have an embedding dimension of 512, 8 heads, 6 layers and hidden dimensions (MLP) of 2048. During pre-training, we use a batch size of 4096 using Adam with β1 =0.9 and β2 =0.999, and use a weight decay of 0.05 for Omni Net. We use a learning rate of 8e 4 with a linear decay and a linear warmup of 10K steps.