Megalodon: Efficient LLM Pretraining and Inference with Unlimited Context Length

Authors: Xuezhe Ma, Xiaomeng Yang, Wenhan Xiong, Beidi Chen, LILI YU, Hao Zhang, Jonathan May, Luke Zettlemoyer, Omer Levy, Chunting Zhou

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In a controlled head-to-head comparison with LLAMA2, MEGALODON achieves better efficiency than Transformer in the scale of 7 billion parameters and 2 trillion training tokens. MEGALODON reaches a training loss of 1.70, landing mid-way between LLAMA27B (1.75) and 13B (1.67). The improvements of MEGALODON over Transformers are robust throughout a range of benchmarks across different tasks and modalities.
Researcher Affiliation Collaboration µAI at Meta πUniversity of Southern California κCarnegie Mellon University δUniversity of California San Diego
Pseudocode No The paper does not contain structured pseudocode or algorithm blocks.
Open Source Code Yes Code: https://github.com/Xuezhe Max/megalodon
Open Datasets Yes We use the same mix of publicly available data from LLAMA2, ensuring that the model are trained on exactly the same 2-trillion tokens. We also use the same tokenizer as LLAMA2, whose vocabulary size is 32K.
Dataset Splits Yes To demonstrate the capability of MEGALODON to make use of very long contexts to improve next-token prediction, we start by conducting the evaluation of valid perplexity on different context lengths. Concretely, we construct a validation dataset which consists of 1,920 selected books. Each of these books contains sequences with at least 2M tokens. The validation dataset is constructed by first randomly shuffling all the files and then concatenating them.
Hardware Specification Yes The global batch size is 4M tokens, and is distributed on 256 NVIDIA A100 GPUs (16K tokens per A100).
Software Dependencies No The paper mentions 'Flash-Attention V2 (Dao, 2024)' and 'Py Torch (Paszke et al., 2019)' but does not provide specific version numbers for general software dependencies (e.g., Python, PyTorch).
Experiment Setup Yes In our MEGALODON-7B model, we adopt most of architectural hyperparameters from LLAMA2-7B to ensure fair comparison: MEGALODON-7B consists of 32 blocks, with feature dimension d = 4096. Following LLAMA2, we use the Swi GLU activation function (Shazeer, 2020) in the feed-forward layer, and rotary positional embedding (Ro PE, Su et al. (2021)). We set the attention chunk size c = 4096, which is the same as the pretraining context length in LLAMA2. Benefiting from the attention gate (γ in (18)), we use a much smaller number of attention heads h = 4 in MEGALODON-7B, comparing to h = 32 in LLAMA2-7B. In addition, we apply pre-norm with two-hop residual ( 3.4), using Timestep Normalization ( 3.2) and Layer Normalization (Ba et al., 2016), while LLAMA2 models apply pre-normalization with RMSNorm (Zhang and Sennrich, 2019). We trained MEGALODON-7B using the Adam W optimizer (Loshchilov and Hutter, 2019), with β1 = 0.9, β2 = 0.95, ϵ = 1e 8. The learning rate is 3.5e 4 and cosine learning rate schedule is applied with warmup of 2500 steps. We use a weight decay of 0.1 and gradient clipping of 1.0, and no dropout is applied during training. The context length in pretraining is 32K (4 attention chunks). The global batch size is 4M tokens, and is distributed on 256 NVIDIA A100 GPUs (16K tokens per A100). We set data parallel size to 128, chunk parallel size to 2 and tensor parallel size to 1.