Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Activation sharding for scalable training of large models

Authors: Xingzi Xu, Amir Tavanaei, Kavosh Asadi, Karim Bouyarmane

TMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Empirical results show the proposed adjoint sharding algorithm reduces memory usage by up to 3 on a large language model with 1.27B parameters on 1M context length training. This reduction in memory usage allows increasing the maximum context length of training a 1.27B parameter model from 35K tokens to above 100K tokens on a training infrastructure composed of five AWS P4 instances.
Researcher Affiliation Industry EMAIL
Pseudocode Yes Algorithm 1 Evaluating adjoint states for token index t and Res Net index k with truncated adjoint sharding T Algorithm 2 Evaluating the vjp s for token index t and Res Net index k with truncated adjoint sharding T Algorithm 3 Forward step in evaluation mode on a distributed system Algorithm 4 Evaluating d L dĪø with truncated adjoint sharding T on Ī„ devices
Open Source Code No We leave the efficient implementation of the parallel algorithm on a CUDA kernel for future work.
Open Datasets No The paper mentions "training with a dataset containing contexts of lengths T" but does not provide any specific dataset names, links, or formal citations for public access to the datasets used in their experiments.
Dataset Splits No The paper states "training with a dataset containing contexts of lengths T", but does not specify any training/test/validation splits (e.g., percentages, sample counts, or references to predefined splits).
Hardware Specification Yes This reduction in memory usage allows increasing the maximum context length of training a 1.27B parameter model from 35K tokens to above 100K tokens on a training infrastructure composed of five AWS P4 instances. An NVIDIA H100 Tensor Core GPU has a GPU memory bandwidth 3.35TB/s and performs 1, 979 tera FP16 FLOPS per second.
Software Dependencies No The paper mentions software like "autograd framework" and the use of a "CUDA kernel" but does not specify any version numbers for these or other libraries/tools.
Experiment Setup Yes When computing with a selective diagonal SSM with P = 128, N = 225, and bs = 8, while storing and performing computations in FP16, computing vjp A, vjp B, and vjp C each takes around 0.6MB memory and 1798144 FLOPs.