Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Block Transformer: Global-to-Local Language Modeling for Fast Inference

Authors: Namgyu Ho, Sangmin Bae, Taehyeon Kim, Hyunjik Jo, Yireun Kim, Tal Schuster, Adam Fisch, James Thorne, Se-Young Yun

NeurIPS 2024 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We pretrain vanilla and Block Transformers from scratch and demonstrate that Block Transformers reach 10-20x inference throughput compared to vanilla transformers with equivalent perplexity and zero-shot task performance.
Researcher Affiliation	Collaboration	Namgyu Ho1,2 Sangmin Bae1 Taehyeon Kim1 Hyunjik Jo2 Yireun Kim2 Tal Schuster3 Adam Fisch3 James Thorne1 Se-Young Yun1 1KAIST AI 2LG AI Research 3Google Deep Mind
Pseudocode	No	The paper describes the architecture and mechanisms in prose and diagrams but does not contain structured pseudocode or algorithm blocks.
Open Source Code	Yes	https://github.com/itsnamgyu/block-transformer
Open Datasets	Yes	We use the transformer architecture of Pythia [10], and train both vanilla and Block Transformer models on the Pile [30, 9] with a context length of 2048.
Dataset Splits	No	The paper uses external benchmarks for evaluation but does not specify internal training/validation/test dataset splits for the primary training data (The Pile).
Hardware Specification	Yes	Eight A100 GPUs with 40 Gi B of VRAM are used for training, while an H100 GPU is used for inference wall-time measurements.
Software Dependencies	No	The paper mentions software like 'Hugging Face training framework', 'Deep Speed library', and 'GPT-Neo X library' but does not specify their version numbers.
Experiment Setup	Yes	We use the transformer architecture of Pythia [10], and train both vanilla and Block Transformer models on the Pile [30, 9] with a context length of 2048. The models are pretrained on 300B tokens, which corresponds to about 1.5 epochs.