Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

SubTrack++ : Gradient Subspace Tracking for Scalable LLM Training

Authors: Sahar Rajabi, Nayeema Nonta, Sirisha Rambhatla

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluated Sub Track++ across diverse models and datasets through pre-training and fine-tuning, measuring key metrics critical to LLM democratization. We pre-trained several Llama-based models on the C4 dataset, with results in Table 1. To ensure a fair comparison, we benchmarked against a diverse set of baselines.
Researcher Affiliation	Academia	Sahar Rajabi, Nayeema Nonta, Sirisha Rambhatla Critical ML, Department of Management Science and Engineering, University of Waterloo EMAIL
Pseudocode	Yes	Algorithm 1 Sub Track++ ( Subspace Tracking , Projection-Aware Optimizer , Recovery Scaling , Regular Adam )
Open Source Code	Yes	Code is at https://github.com/criticalml-uw/Sub Track.
Open Datasets	Yes	We pre-trained several Llama-based models on the C4 dataset, with results in Table 1. Ro BERTa-Base and Ro BERTa-Large are fine-tuned on GLUE [Wang et al., 2019] and Super GLUE [Sarlin et al., 2020] tasks; with the results presented in Table 7 and 8, respectively. We also conducted supervised fine-tuning of the Llama-2-7B-chat-hf model for one epoch on the Alpaca [Taori et al., 2023] dataset.
Dataset Splits	No	The paper mentions using well-known public datasets such as C4, GLUE, Super GLUE, and Alpaca, which often come with predefined splits. However, it does not explicitly state the specific training/validation/test splits used for its experiments (e.g., exact percentages, sample counts, or explicit reference to using the 'standard splits' for these datasets).
Hardware Specification	Yes	The 60M to 3B models are conducted on an NVIDIA A100 GPU, while the 7B model experiments are run on an NVIDIA RTX A6000. The fine-tuning was performed for one epoch, and the corresponding hyperparameters are listed in Table 12. on an Nvidia-H100 GPU.
Software Dependencies	No	The paper does not provide specific version numbers for any software dependencies (e.g., programming languages, libraries, or frameworks).
Experiment Setup	Yes	Table 4: Hyperparameters of pre-training Llama-based architectures. This table lists specific hyperparameters such as Learning Rate, Batch Size, Gradient Accumulation, Iterations, Warmup Steps, Gradient Clipping, dtype, Low-Rank Optimizer Rank, Subspace Update Interval, and Sub Track++ Step-Size. Tables 10, 11, and 12 also detail hyperparameters for fine-tuning experiments.