Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Block Transformer: Global-to-Local Language Modeling for Fast Inference
Authors: Namgyu Ho, Sangmin Bae, Taehyeon Kim, Hyunjik Jo, Yireun Kim, Tal Schuster, Adam Fisch, James Thorne, Se-Young Yun
NeurIPS 2024 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We pretrain vanilla and Block Transformers from scratch and demonstrate that Block Transformers reach 10-20x inference throughput compared to vanilla transformers with equivalent perplexity and zero-shot task performance. |
| Researcher Affiliation | Collaboration | Namgyu Ho1,2 Sangmin Bae1 Taehyeon Kim1 Hyunjik Jo2 Yireun Kim2 Tal Schuster3 Adam Fisch3 James Thorne1 Se-Young Yun1 1KAIST AI 2LG AI Research 3Google Deep Mind |
| Pseudocode | No | The paper describes the architecture and mechanisms in prose and diagrams but does not contain structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | https://github.com/itsnamgyu/block-transformer |
| Open Datasets | Yes | We use the transformer architecture of Pythia [10], and train both vanilla and Block Transformer models on the Pile [30, 9] with a context length of 2048. |
| Dataset Splits | No | The paper uses external benchmarks for evaluation but does not specify internal training/validation/test dataset splits for the primary training data (The Pile). |
| Hardware Specification | Yes | Eight A100 GPUs with 40 Gi B of VRAM are used for training, while an H100 GPU is used for inference wall-time measurements. |
| Software Dependencies | No | The paper mentions software like 'Hugging Face training framework', 'Deep Speed library', and 'GPT-Neo X library' but does not specify their version numbers. |
| Experiment Setup | Yes | We use the transformer architecture of Pythia [10], and train both vanilla and Block Transformer models on the Pile [30, 9] with a context length of 2048. The models are pretrained on 300B tokens, which corresponds to about 1.5 epochs. |