Block Transformer: Global-to-Local Language Modeling for Fast Inference

Authors: Namgyu Ho, Sangmin Bae, Taehyeon Kim, Hyunjik Jo, Yireun Kim, Tal Schuster, Adam Fisch, James Thorne, Se-Young Yun

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We pretrain vanilla and Block Transformers from scratch and demonstrate that Block Transformers reach 10-20x inference throughput compared to vanilla transformers with equivalent perplexity and zero-shot task performance.
Researcher Affiliation Collaboration Namgyu Ho1,2 Sangmin Bae1 Taehyeon Kim1 Hyunjik Jo2 Yireun Kim2 Tal Schuster3 Adam Fisch3 James Thorne1 Se-Young Yun1 1KAIST AI 2LG AI Research 3Google Deep Mind
Pseudocode No The paper describes the architecture and mechanisms in prose and diagrams but does not contain structured pseudocode or algorithm blocks.
Open Source Code Yes https://github.com/itsnamgyu/block-transformer
Open Datasets Yes We use the transformer architecture of Pythia [10], and train both vanilla and Block Transformer models on the Pile [30, 9] with a context length of 2048.
Dataset Splits No The paper uses external benchmarks for evaluation but does not specify internal training/validation/test dataset splits for the primary training data (The Pile).
Hardware Specification Yes Eight A100 GPUs with 40 Gi B of VRAM are used for training, while an H100 GPU is used for inference wall-time measurements.
Software Dependencies No The paper mentions software like 'Hugging Face training framework', 'Deep Speed library', and 'GPT-Neo X library' but does not specify their version numbers.
Experiment Setup Yes We use the transformer architecture of Pythia [10], and train both vanilla and Block Transformer models on the Pile [30, 9] with a context length of 2048. The models are pretrained on 300B tokens, which corresponds to about 1.5 epochs.