Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Chain-of-Model Learning for Language Model

Authors: Xiaohua Wang, Kaitao Song, Xu Tan, Huiqiang Jiang, Chengruidong Zhang, Yongliang Shen, Cen Lu, Zihao Li, Zifan Song, Caihua Shan, Yansen Wang, Kan Ren, Xiaoqing Zheng, Tao Qin, Yuqing Yang, Dongsheng Li, Lili Qiu

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental results demonstrate our Co LM family can achieve comparable performance to the standard Transformer, while simultaneously enabling greater flexiblity, such as progressive scaling to improve training efficiency and offer multiple varying model sizes for elastic inference, paving a a new way toward building language models.
Researcher Affiliation Collaboration Fudan University1 Microsoft Research2 Zhejiang University3 Shanghai Tech University4 Idiap Research Institute5 EPFL6 UIUC7 Tongji University8
Pseudocode Yes Algorithm 1 Pseudo Code for Chain-of-Linear Layer (Linear_chain.py)
Open Source Code No Our code will be released in the future at: https://github.com/microsoft/Co LM.
Open Datasets Yes In our experiments, we utilize the Slim Pajama dataset [22] as the pre-training corpus, comprising 600 billion tokens. ... [22] Daria Soboleva, Faisal Al-Khateeb, Robert Myers, Jacob R Steeves, Joel Hestness, and Nolan Dey. Slim Pajama: A 627B token cleaned and deduplicated version of Red Pajama. https://cerebras.ai/blog/ slimpajama-a-627b-token-cleaned-and-deduplicated-version-of-redpajama, June 2023.
Dataset Splits Yes We use the Eleuther AI Language Model Evaluation Harness [25] to evaluate models on commonsense tasks in a zero-shot setting. ... We choose GLUE benchmark [30] for a faster validation. We report the results within Table 4, and more experimental details can refer to Appendix A.1. ... We used the default evaluation configuration.
Hardware Specification Yes Our training is built upon a cluster of 32 NVIDIA A100 40GB GPUs with a gradient accumulation of 4, yielding an effective batch size of 1024. Our prefilling is evaluated on a single NVIDIA A100 GPU with a batch size of 1.
Software Dependencies No We leverage Fully Sharded Data Parallel (FSDP2) for distributed training, facilitated by the Torch Titan framework [23]. BFloat16 is adopted for training, and the sequence length is set as 4096 tokens. We choose Adam W optimizer [24] with a learning rate of 1.5 10 4. Flash Attention-2 [38] is applied to speed up attention computation. The paper mentions software components like Torch Titan framework, Adam W optimizer, and Flash Attention-2 but does not provide specific version numbers for these key components.
Experiment Setup Yes Our training is built upon a cluster of 32 NVIDIA A100 40GB GPUs with a gradient accumulation of 4, yielding an effective batch size of 1024. BFloat16 is adopted for training, and the sequence length is set as 4096 tokens. We choose Adam W optimizer [24] with a learning rate of 1.5 10 4. Due to resource limitations, our model is pre-trained with 50K steps, nearly 200 billion tokens. (Further details in Appendix A.1, Table 7 & 8).