Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Polynomial Composition Activations: Unleashing the Dynamics of Large Language Models
Authors: Zhijian Zhuo, Ya Wang, Yutao Zeng, Xiaoqing Li, Xun Zhou, Jinwen Ma
ICLR 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We conduct empirical experiments on the pre-training configurations of large language models (LLMs), including both dense and sparse architectures. By substituting conventional activation functions with Poly Com, we enable LLMs to capture higher-order interactions within the data, thus improving performance metrics in terms of accuracy and convergence rates. Extensive experimental results demonstrate the effectiveness of our method, showing substantial improvements over other activation functions. |
| Researcher Affiliation | Collaboration | 1 School of Mathematical Sciences, Peking University 2 Seed-Foundation-Model, Byte Dance 3 Capital University of Economics and Business |
| Pseudocode | Yes | Algorithm 1 Py Torch-Style Implementation of Poly Re LU Algorithm 2 Py Torch-Style Implementation of Poly Norm |
| Open Source Code | Yes | Code is available at https://github.com/Bryce Zhuo/Poly Com. |
| Open Datasets | Yes | The dense model is trained on the Red Pajama-1T dataset 1 (Computer, 2023), which was developed by the open-source AI community to enable competitive performance against proprietary models. 1Red Pajama-1T is available at https://github.com/togethercomputer/Red Pajama-Data. The Mo E model is trained on the OLMo E Mix dataset 2 (Muennighoff et al., 2024). 2OLMo E Mix dataset is available at https://huggingface.co/datasets/allenai/ OLMo E-mix-0924. |
| Dataset Splits | No | The paper does not explicitly provide specific training/validation/test splits (e.g., percentages or sample counts) for the datasets used for its main experiments (Red Pajama-1T and OLMoE Mix). While it mentions training loss and validation perplexity, implying the existence of a validation set, the method or proportion of splitting is not detailed. For evaluation, it refers to standard benchmarks but does not specify their splits within the paper. |
| Hardware Specification | Yes | All experiments are conducted on NVIDIA A100-80G GPUs, 32 GPUs for the dense model, and 64 GPUs for the Mo E model. |
| Software Dependencies | No | The paper mentions PyTorch implementations (Appendix D), the Adam W optimizer (Section 4.1), timm (Section H), and LM Eval Harness (Section 4.1). However, it does not provide specific version numbers for any of these software dependencies. |
| Experiment Setup | Yes | Unless otherwise specified, we use a 3-order Poly Com by default and initialize the coefficients as ai = 1/3 for i = 1, 2, 3 and set a0 = 0. Model weights are randomly initialized. For optimization, we apply the Adam W optimizer with β1 = 0.9 and β2 = 0.95. All models are trained on sequences of 4096 tokens. For the dense model, we set the initial learning rate to 3e-4, decaying to 1.5e-5 using a cosine scheduler. The Mo E model starts with a learning rate of 4e-4, also decaying according to a cosine schedule. We summarize the hyperparameters in Table 7. |