Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Reasoning is Periodicity? Improving Large Language Models Through Effective Periodicity Modeling

Authors: Yihong Dong, Ge Li, Xue Jiang, Yongding Tao, Kechi Zhang, Lecheng Wang, Hao Zhu, Huanyu Liu, jiazheng ding, Jia Li, Jinliang Deng, Hong Mei

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental To comprehensively validate the effectiveness and scalability of FANformer, we conduct extensive experiments on language modeling tasks. The results of scaling both model parameters and training tokens highlight that FANformer consistently surpasses Transformer, requiring only 69.2% of model parameters or 79.7% of training tokens to achieve comparable performance. We also implement a complete pretraining pipeline to pretrain a 1.1-billion parameter FANformer (FANformer-1B) on 1 trillion tokens. Experiments on various downstream tasks demonstrate that FANformer-1B outperforms open-source LLMs of the same size with fewer training tokens, and exceeds LLMs with three times the parameters when using the same training token. Through further analysis, we reveal that FANformer is a superior choice compared to other variant architectures and discover three interesting findings: 1) By observing the training process, we discover the notable enhancements in FANformer s learning efficiency over Transformer as the model continues to learn from the data. 2) FANformer facilitates the rule-based reasoning paradigm, mitigating the occurrence of "holes" inherent in the case-based learning of Transformer [Hu et al., 2024]. Under the stress test of logical reasoning [Wang et al., 2024], FANformer-1B demonstrates superior performance compared to OLMo-1B and Qwen2.5-1.5B. 3) FANformer s representational capacity consistently surpasses that of Transformer across various layer depths, as evidenced by evaluations of the model s Lipschitz constant [Latorre et al., 2020].
Researcher Affiliation Collaboration 1School of Computer Science, Peking University 2ai Xcoder 3The Hong Kong University of Science and Technology 4Advanced Institute of Big Data
Pseudocode Yes Figure 2: Left: The illustration of FANformer s architecture. Right: The pseudocode of Multi-head ATF, where p is the hyperparameter that controls the proportion of periodicity modeling for Xp.
Open Source Code Yes Our code is available at https://github.com/Yihong Dong/FANformer.
Open Datasets Yes For pretraining FANformer-1B, we randomly sample 1T training tokens from OLMo s training data, i.e., Dolma [Soldaini et al., 2024]. For other experiments, we train models on a smaller sample of Dolma, i.e., Dolma v1_6-sample [Allen AI, 2023], with roughly 10B tokens.
Dataset Splits Yes We sample 400K training data from the function of mod 5 and train a 110M Transformer for 4K epochs. (...) Specifically, we extract a square comprising 441 samples (from a total of approximately 10,000 samples) with a side length of 20 to form our test set, leaving the remainder as the training set.
Hardware Specification Yes The experiments are conducted on 80 A100 GPUs. (...) The configuration of benchmark test: we run for 20 iterations on a single GPU of A100 80G with a fixed sequence length of 4096 tokens and float16 precision.
Software Dependencies No We train FANformer-1B using the Ze RO optimizer strategy [Rajbhandari et al., 2020] via Py Torch s DDP framework [Li, 2018]. Following OLMo [Groeneveld et al., 2024], we use a constant global batch size of approximately 4M tokens (2048 instances, each with a sequence length of 2048 tokens). To improve throughput, we employ Py Torch s amp module with the bfloat16 format. We employ the Adam W optimizer [Loshchilov and Hutter, 2019] for the model s training process. The learning rate for all LLMs is set to 4.0e-4. We warm up the learning rate over 2000 steps ( 8B tokens) and then decay it in a cosine manner from there down to a tenth of the peak learning rate over the remainder of training. We employ Flash Attention [Dao et al., 2022] to accelerate the model training and inference processes, leveraging its ability to optimize memory usage and computational efficiency.
Experiment Setup Yes We build FANformer upon the foundation of OLMo [Groeneveld et al., 2024], as it provides a solid pretraining framework of LLMs, with the hyperparameter p set to 0.25 by default. (...) We train FANformer-1B using the Ze RO optimizer strategy [Rajbhandari et al., 2020] via Py Torch s DDP framework [Li, 2018]. Following OLMo [Groeneveld et al., 2024], we use a constant global batch size of approximately 4M tokens (2048 instances, each with a sequence length of 2048 tokens). To improve throughput, we employ Py Torch s amp module with the bfloat16 format. We employ the Adam W optimizer [Loshchilov and Hutter, 2019] for the model s training process. The learning rate for all LLMs is set to 4.0e-4. We warm up the learning rate over 2000 steps ( 8B tokens) and then decay it in a cosine manner from there down to a tenth of the peak learning rate over the remainder of training.