Various Lengths, Constant Speed: Efficient Language Modeling with Lightning Attention

Authors: Zhen Qin, Weigao Sun, Dong Li, Xuyang Shen, Weixuan Sun, Yiran Zhong

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We conduct rigorous testing on standard and self-collected datasets with varying model sizes and sequence lengths.
Researcher Affiliation Collaboration 1Tap Tap 2Open NLPLab, Shanghai AI Lab. Correspondence to: Yiran Zhong <zhongyiran@gmail.com>.
Pseudocode Yes Algorithm 1 Linear Attention Left Product
Open Source Code Yes The source code is released at github.com/Open NLPLab/Transnormer LLM.
Open Datasets Yes TNL records the lowest perplexity on test set after trained on the Wikitext-103 dataset.
Dataset Splits Yes Table 1. Results on Wikitext-103 (TNN(Qin et al., 2023a) s setting). means lower is better. Model PPL (val) PPL (test) Params (M)
Hardware Specification Yes All the experiments were conducted on A100 80G GPU clusters.
Software Dependencies No The paper mentions software components like "Metaseq framework," "Pytorch," and "Triton" but does not specify their version numbers, which are necessary for full reproducibility.
Experiment Setup Yes We conduct rigorous testing on standard and self-collected datasets with varying model sizes and sequence lengths. We also scaled up our model to 1B and 3B parameters and compared its training loss with top-tier LLM structures... all evaluation results being conducted with a 5-shot setup.