Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Activation sharding for scalable training of large models
Authors: Xingzi Xu, Amir Tavanaei, Kavosh Asadi, Karim Bouyarmane
TMLR 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Empirical results show the proposed adjoint sharding algorithm reduces memory usage by up to 3 on a large language model with 1.27B parameters on 1M context length training. This reduction in memory usage allows increasing the maximum context length of training a 1.27B parameter model from 35K tokens to above 100K tokens on a training infrastructure composed of five AWS P4 instances. |
| Researcher Affiliation | Industry | |
| Pseudocode | Yes | Algorithm 1 Evaluating adjoint states for token index t and Res Net index k with truncated adjoint sharding T Algorithm 2 Evaluating the vjp s for token index t and Res Net index k with truncated adjoint sharding T Algorithm 3 Forward step in evaluation mode on a distributed system Algorithm 4 Evaluating d L dĪø with truncated adjoint sharding T on Ī„ devices |
| Open Source Code | No | We leave the efficient implementation of the parallel algorithm on a CUDA kernel for future work. |
| Open Datasets | No | The paper mentions "training with a dataset containing contexts of lengths T" but does not provide any specific dataset names, links, or formal citations for public access to the datasets used in their experiments. |
| Dataset Splits | No | The paper states "training with a dataset containing contexts of lengths T", but does not specify any training/test/validation splits (e.g., percentages, sample counts, or references to predefined splits). |
| Hardware Specification | Yes | This reduction in memory usage allows increasing the maximum context length of training a 1.27B parameter model from 35K tokens to above 100K tokens on a training infrastructure composed of five AWS P4 instances. An NVIDIA H100 Tensor Core GPU has a GPU memory bandwidth 3.35TB/s and performs 1, 979 tera FP16 FLOPS per second. |
| Software Dependencies | No | The paper mentions software like "autograd framework" and the use of a "CUDA kernel" but does not specify any version numbers for these or other libraries/tools. |
| Experiment Setup | Yes | When computing with a selective diagonal SSM with P = 128, N = 225, and bs = 8, while storing and performing computations in FP16, computing vjp A, vjp B, and vjp C each takes around 0.6MB memory and 1798144 FLOPs. |