Block Transformer: Global-to-Local Language Modeling for Fast Inference
Authors: Namgyu Ho, Sangmin Bae, Taehyeon Kim, Hyunjik Jo, Yireun Kim, Tal Schuster, Adam Fisch, James Thorne, Se-Young Yun
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We pretrain vanilla and Block Transformers from scratch and demonstrate that Block Transformers reach 10-20x inference throughput compared to vanilla transformers with equivalent perplexity and zero-shot task performance. |
| Researcher Affiliation | Collaboration | Namgyu Ho1,2 Sangmin Bae1 Taehyeon Kim1 Hyunjik Jo2 Yireun Kim2 Tal Schuster3 Adam Fisch3 James Thorne1 Se-Young Yun1 1KAIST AI 2LG AI Research 3Google Deep Mind |
| Pseudocode | No | The paper describes the architecture and mechanisms in prose and diagrams but does not contain structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | https://github.com/itsnamgyu/block-transformer |
| Open Datasets | Yes | We use the transformer architecture of Pythia [10], and train both vanilla and Block Transformer models on the Pile [30, 9] with a context length of 2048. |
| Dataset Splits | No | The paper uses external benchmarks for evaluation but does not specify internal training/validation/test dataset splits for the primary training data (The Pile). |
| Hardware Specification | Yes | Eight A100 GPUs with 40 Gi B of VRAM are used for training, while an H100 GPU is used for inference wall-time measurements. |
| Software Dependencies | No | The paper mentions software like 'Hugging Face training framework', 'Deep Speed library', and 'GPT-Neo X library' but does not specify their version numbers. |
| Experiment Setup | Yes | We use the transformer architecture of Pythia [10], and train both vanilla and Block Transformer models on the Pile [30, 9] with a context length of 2048. The models are pretrained on 300B tokens, which corresponds to about 1.5 epochs. |