Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Polynomial Composition Activations: Unleashing the Dynamics of Large Language Models

Authors: Zhijian Zhuo, Ya Wang, Yutao Zeng, Xiaoqing Li, Xun Zhou, Jinwen Ma

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We conduct empirical experiments on the pre-training configurations of large language models (LLMs), including both dense and sparse architectures. By substituting conventional activation functions with Poly Com, we enable LLMs to capture higher-order interactions within the data, thus improving performance metrics in terms of accuracy and convergence rates. Extensive experimental results demonstrate the effectiveness of our method, showing substantial improvements over other activation functions.
Researcher Affiliation	Collaboration	1 School of Mathematical Sciences, Peking University 2 Seed-Foundation-Model, Byte Dance 3 Capital University of Economics and Business
Pseudocode	Yes	Algorithm 1 Py Torch-Style Implementation of Poly Re LU Algorithm 2 Py Torch-Style Implementation of Poly Norm
Open Source Code	Yes	Code is available at https://github.com/Bryce Zhuo/Poly Com.
Open Datasets	Yes	The dense model is trained on the Red Pajama-1T dataset 1 (Computer, 2023), which was developed by the open-source AI community to enable competitive performance against proprietary models. 1Red Pajama-1T is available at https://github.com/togethercomputer/Red Pajama-Data. The Mo E model is trained on the OLMo E Mix dataset 2 (Muennighoff et al., 2024). 2OLMo E Mix dataset is available at https://huggingface.co/datasets/allenai/ OLMo E-mix-0924.
Dataset Splits	No	The paper does not explicitly provide specific training/validation/test splits (e.g., percentages or sample counts) for the datasets used for its main experiments (Red Pajama-1T and OLMoE Mix). While it mentions training loss and validation perplexity, implying the existence of a validation set, the method or proportion of splitting is not detailed. For evaluation, it refers to standard benchmarks but does not specify their splits within the paper.
Hardware Specification	Yes	All experiments are conducted on NVIDIA A100-80G GPUs, 32 GPUs for the dense model, and 64 GPUs for the Mo E model.
Software Dependencies	No	The paper mentions PyTorch implementations (Appendix D), the Adam W optimizer (Section 4.1), timm (Section H), and LM Eval Harness (Section 4.1). However, it does not provide specific version numbers for any of these software dependencies.
Experiment Setup	Yes	Unless otherwise specified, we use a 3-order Poly Com by default and initialize the coefficients as ai = 1/3 for i = 1, 2, 3 and set a0 = 0. Model weights are randomly initialized. For optimization, we apply the Adam W optimizer with β1 = 0.9 and β2 = 0.95. All models are trained on sequences of 4096 tokens. For the dense model, we set the initial learning rate to 3e-4, decaying to 1.5e-5 using a cosine scheduler. The Mo E model starts with a learning rate of 4e-4, also decaying according to a cosine schedule. We summarize the hyperparameters in Table 7.