Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Activation sharding for scalable training of large models
Authors: Xingzi Xu, Amir Tavanaei, Kavosh Asadi, Karim Bouyarmane
TMLR 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Empirical results show the proposed adjoint sharding algorithm reduces memory usage by up to 3 on a large language model with 1.27B parameters on 1M context length training. This reduction in memory usage allows increasing the maximum context length of training a 1.27B parameter model from 35K tokens to above 100K tokens on a training infrastructure composed of five AWS P4 instances. |
| Researcher Affiliation | Industry | |
| Pseudocode | Yes | Algorithm 1 Evaluating adjoint states for token index t and Res Net index k with truncated adjoint sharding T Algorithm 2 Evaluating the vjp s for token index t and Res Net index k with truncated adjoint sharding T Algorithm 3 Forward step in evaluation mode on a distributed system Algorithm 4 Evaluating d L dĪø with truncated adjoint sharding T on Ī„ devices |
| Open Source Code | No | We leave the efficient implementation of the parallel algorithm on a CUDA kernel for future work. |
| Open Datasets | No | The paper mentions "training with a dataset containing contexts of lengths T" but does not provide any specific dataset names, links, or formal citations for public access to the datasets used in their experiments. |
| Dataset Splits | No | The paper states "training with a dataset containing contexts of lengths T", but does not specify any training/test/validation splits (e.g., percentages, sample counts, or references to predefined splits). |
| Hardware Specification | Yes | This reduction in memory usage allows increasing the maximum context length of training a 1.27B parameter model from 35K tokens to above 100K tokens on a training infrastructure composed of five AWS P4 instances. An NVIDIA H100 Tensor Core GPU has a GPU memory bandwidth 3.35TB/s and performs 1, 979 tera FP16 FLOPS per second. |
| Software Dependencies | No | The paper mentions software like "autograd framework" and the use of a "CUDA kernel" but does not specify any version numbers for these or other libraries/tools. |
| Experiment Setup | Yes | When computing with a selective diagonal SSM with P = 128, N = 225, and bs = 8, while storing and performing computations in FP16, computing vjp A, vjp B, and vjp C each takes around 0.6MB memory and 1798144 FLOPs. |