Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
OmniNet: Omnidirectional Representations from Transformers
Authors: Yi Tay, Mostafa Dehghani, Vamsi Aribandi, Jai Gupta, Philip M Pham, Zhen Qin, Dara Bahri, Da-Cheng Juan, Donald Metzler
ICML 2021 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments are conducted on autoregressive language modeling (LM1B, C4), Machine Translation, Long Range Arena (LRA), and Image Recognition. |
| Researcher Affiliation | Industry | 1Google Research, Mountain View 2Google Brain Team, Amsterdam 3Google AI Resident. |
| Pseudocode | No | The paper does not contain explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not contain any explicit statement about releasing the source code for the described methodology. |
| Open Datasets | Yes | We use two large-scale datasets, language modeling one billion (LM1B) (Chelba et al., 2013) and the Colossal Cleaned Common Crawl corpus (C4) (Raffel et al., 2019). We use five collections/datasets from WMT-17... we pre-train our Omni Net models on the JFT dataset (Sun et al., 2017). |
| Dataset Splits | Yes | For both tasks, we use a max length of 256 subword tokens per example and report scores on subword perplexity on the validation set. We evaluate our models in the transfer setup (few-shot and fine-tuning) on several downstream tasks: Image Net, CIFAR-10, CIFAR-100 (Krizhevsky et al., 2009), Oxford-IIIT Pets (Parkhi et al., 2012), and Oxford Flowers-102 (Nilsback & Zisserman, 2008). |
| Hardware Specification | Yes | For both tasks, we train all models for 30K for LM1b and 100K steps for C4 using 16 TPU-V3 Chips. |
| Software Dependencies | No | Our implementation uses Flax (Heek et al., 2020) and Jax (Bradbury et al., 2018). While frameworks are mentioned, specific version numbers are not provided. |
| Experiment Setup | Yes | Models are of base size and have an embedding dimension of 512, 8 heads, 6 layers and hidden dimensions (MLP) of 2048. During pre-training, we use a batch size of 4096 using Adam with β1 =0.9 and β2 =0.999, and use a weight decay of 0.05 for Omni Net. We use a learning rate of 8e 4 with a linear decay and a linear warmup of 10K steps. |