Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

StreamBP: Memory-Efficient Exact Backpropagation for Long Sequence Training of LLMs

Authors: Qijun Luo, Mengqi Li, Lei Zhao, Xiao Li

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this section, we evaluate Stream BP based on the Qwen 3 model series. The result directly applies for any causal language models such as Llama, Mistral, and Gemma, since Stream BP is not restricted to a certain model class. Our evaluation mainly consists of 3 parts, including 1) backpropogation cost; 2) training cost; and 3) distributed training. All the experiments are conducted using A800-80GB GPUs. The detailed setup is presented in Appendix D.
Researcher Affiliation	Academia	Qijun Luo1 Mengqi Li1 Lei Zhao2 Xiao Li1 1The Chinese University of Hong Kong, Shenzhen 2Shanghai Jiao Tong University EMAIL, EMAIL, EMAIL
Pseudocode	No	The paper describes the Stream BP method using mathematical formulations and descriptive text, but it does not include a clearly labeled pseudocode block or algorithm section.
Open Source Code	Yes	Our code can be easily integrated into the training pipeline of any transformer models and is available at https: //github.com/Ledzy/Stream BP.
Open Datasets	No	The paper mentions datasets like "Capybara", "Ultrafeedback", and "TL;DR" in Table 7 in the context of SFT, DPO, and GRPO objectives. While these are common dataset names, the paper does not provide concrete access information such as specific links, DOIs, repositories, or formal citations (with authors and year) for these datasets, which are required to confirm public availability.
Dataset Splits	No	The paper discusses batch sizes and sequence lengths for training, but it does not provide specific details on how datasets were split into training, validation, or test sets (e.g., percentages, sample counts, or citations to standard splits).
Hardware Specification	Yes	All the experiments are conducted using A800-80GB GPUs.
Software Dependencies	No	The paper states "We implement our algorithm using Huggingface transformers library [31]." and mentions "Py Torch CUDA memory snapshot tool", but it does not specify version numbers for these software components, which is necessary for a reproducible description.
Experiment Setup	Yes	The partition size of language modeling head is set to 100 for all the experiments. The partition size for transformer layer is set to 500 for maximum sequence length measurement, and is set to T/3 for time measurement. We adopt the BF16 data type for storing model weight and gradient. We adopt pure BF16 Adam optimizer in full training and mixed-precision training in rank-32 Lo RA model training. The batch size is set to 1 except for GRPO, where the group size is set to 8.