MEGABYTE: Predicting Million-byte Sequences with Multiscale Transformers
Authors: LILI YU, Daniel Simig, Colin Flaherty, Armen Aghajanyan, Luke Zettlemoyer, Mike Lewis
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments show that MEGABYTE allows byte-level models to perform competitively with subword models on long context language modeling, achieve state-of-the-art density estimation on Image Net, and model audio from raw files. |
| Researcher Affiliation | Academia | Lili Yu Dániel Simig Colin Flaherty Armen Aghajanyan Luke Zettlemoyer Mike Lewis 37th Conference on Neural Information Processing Systems (NeurIPS 2023). |
| Pseudocode | Yes | B Pseudocode Listing 1: Pseudocode of Megabyte model |
| Open Source Code | No | The paper states: "All models were trained using the Metaseq2 code base Zhang et al. (2022b)." and provides a footnote "2https://github.com/facebookresearch/metaseq". This refers to a third-party codebase they used, not an explicit release of their specific MEGABYTE implementation's source code. |
| Open Datasets | Yes | We evaluated the performance of MEGABYTE on language modeling on a set of 5 diverse datasets emphasizing long-range dependencies: Project Gutenberg (PG-19), Books, Stories, arXiv, and Code. Datasets We experiment on a range of long form text datasets. The PG-19 dataset Rae et al. (2019b)... |
| Dataset Splits | No | The paper discusses validation results (e.g., "Validation Test" column in Table 2) but does not provide specific details on how the training, validation, and test splits were created (e.g., percentages, sample counts, or methodology for creating splits). |
| Hardware Specification | No | The paper mentions "GPU hours" but does not specify any particular GPU models, CPU types, or other hardware components used for running the experiments. |
| Software Dependencies | No | The paper mentions using "the Metaseq2 code base", "PyTorch framework Paszke et al. (2019)", "fairscale", and the "Adam optimizer". However, it does not provide specific version numbers for PyTorch, Metaseq2, or fairscale. |
| Experiment Setup | Yes | We conduct experiments using a fixed compute and data budget across all models... All models were trained using the Metaseq2 code base... The training used the PyTorch framework... Mixed precision training was used... gradient clipping with a maximum norm of 1.0 and used the Adam optimizer with β1 = 0.9, β2 = 0.98... polynomial decay learning rate scheduler in Meta Seq with 500 warmup updates and the end learning rate set to 0. All models are trained with pre-norm and using ReLU activation. We apply a dropout of 0.1 throughout, but we do not apply any dropout to embeddings. We also use weight decay of 0.1. Table 13: Model architecture details. We report the model size, the embedding size (D), number of layaers(L), total batch size (BS), learning rate(LR), and context length. |