Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Activation sharding for scalable training of large models

Authors: Xingzi Xu, Amir Tavanaei, Kavosh Asadi, Karim Bouyarmane

TMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Empirical results show the proposed adjoint sharding algorithm reduces memory usage by up to 3 on a large language model with 1.27B parameters on 1M context length training. This reduction in memory usage allows increasing the maximum context length of training a 1.27B parameter model from 35K tokens to above 100K tokens on a training infrastructure composed of five AWS P4 instances.
Researcher Affiliation	Industry	EMAIL
Pseudocode	Yes	Algorithm 1 Evaluating adjoint states for token index t and Res Net index k with truncated adjoint sharding T Algorithm 2 Evaluating the vjp s for token index t and Res Net index k with truncated adjoint sharding T Algorithm 3 Forward step in evaluation mode on a distributed system Algorithm 4 Evaluating d L dθ with truncated adjoint sharding T on Υ devices
Open Source Code	No	We leave the efficient implementation of the parallel algorithm on a CUDA kernel for future work.
Open Datasets	No	The paper mentions "training with a dataset containing contexts of lengths T" but does not provide any specific dataset names, links, or formal citations for public access to the datasets used in their experiments.
Dataset Splits	No	The paper states "training with a dataset containing contexts of lengths T", but does not specify any training/test/validation splits (e.g., percentages, sample counts, or references to predefined splits).
Hardware Specification	Yes	This reduction in memory usage allows increasing the maximum context length of training a 1.27B parameter model from 35K tokens to above 100K tokens on a training infrastructure composed of five AWS P4 instances. An NVIDIA H100 Tensor Core GPU has a GPU memory bandwidth 3.35TB/s and performs 1, 979 tera FP16 FLOPS per second.
Software Dependencies	No	The paper mentions software like "autograd framework" and the use of a "CUDA kernel" but does not specify any version numbers for these or other libraries/tools.
Experiment Setup	Yes	When computing with a selective diagonal SSM with P = 128, N = 225, and bs = 8, while storing and performing computations in FP16, computing vjp A, vjp B, and vjp C each takes around 0.6MB memory and 1798144 FLOPs.