MEGABYTE: Predicting Million-byte Sequences with Multiscale Transformers

Authors: LILI YU, Daniel Simig, Colin Flaherty, Armen Aghajanyan, Luke Zettlemoyer, Mike Lewis

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments show that MEGABYTE allows byte-level models to perform competitively with subword models on long context language modeling, achieve state-of-the-art density estimation on Image Net, and model audio from raw files.
Researcher Affiliation Academia Lili Yu Dániel Simig Colin Flaherty Armen Aghajanyan Luke Zettlemoyer Mike Lewis 37th Conference on Neural Information Processing Systems (NeurIPS 2023).
Pseudocode Yes B Pseudocode Listing 1: Pseudocode of Megabyte model
Open Source Code No The paper states: "All models were trained using the Metaseq2 code base Zhang et al. (2022b)." and provides a footnote "2https://github.com/facebookresearch/metaseq". This refers to a third-party codebase they used, not an explicit release of their specific MEGABYTE implementation's source code.
Open Datasets Yes We evaluated the performance of MEGABYTE on language modeling on a set of 5 diverse datasets emphasizing long-range dependencies: Project Gutenberg (PG-19), Books, Stories, arXiv, and Code. Datasets We experiment on a range of long form text datasets. The PG-19 dataset Rae et al. (2019b)...
Dataset Splits No The paper discusses validation results (e.g., "Validation Test" column in Table 2) but does not provide specific details on how the training, validation, and test splits were created (e.g., percentages, sample counts, or methodology for creating splits).
Hardware Specification No The paper mentions "GPU hours" but does not specify any particular GPU models, CPU types, or other hardware components used for running the experiments.
Software Dependencies No The paper mentions using "the Metaseq2 code base", "PyTorch framework Paszke et al. (2019)", "fairscale", and the "Adam optimizer". However, it does not provide specific version numbers for PyTorch, Metaseq2, or fairscale.
Experiment Setup Yes We conduct experiments using a fixed compute and data budget across all models... All models were trained using the Metaseq2 code base... The training used the PyTorch framework... Mixed precision training was used... gradient clipping with a maximum norm of 1.0 and used the Adam optimizer with β1 = 0.9, β2 = 0.98... polynomial decay learning rate scheduler in Meta Seq with 500 warmup updates and the end learning rate set to 0. All models are trained with pre-norm and using ReLU activation. We apply a dropout of 0.1 throughout, but we do not apply any dropout to embeddings. We also use weight decay of 0.1. Table 13: Model architecture details. We report the model size, the embedding size (D), number of layaers(L), total batch size (BS), learning rate(LR), and context length.