Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
PaceLLM: Brain-Inspired Large Language Models for Long-Context Understanding
Authors: Kangcong Li, Peng Ye, Chongjun Tu, Lin Zhang, Chunfeng Song, Jiamin Wu, Tao Yang, Qihao Zheng, Tao Chen
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive evaluations show that Pace LLM achieves 6% improvement on Long Benchโs Multi-document QA and 12.5-17.5% performance gains on ๐-Bench tasks, while extending measurable context length to 200K tokens in Needle-In-A-Haystack (NIAH) tests. |
| Researcher Affiliation | Collaboration | Kangcong Li1 , Peng Ye2,3 , Chongjun Tu1, Lin Zhang1, Chunfeng Song2, Jiamin Wu2, Tao Yang1, Qihao Zheng2 , Tao Chen1 1School of Information Science and Technology, Fudan University 2Shanghai Artificial Intelligence Laboratory 3The Chinese University of Hong Kong |
| Pseudocode | Yes | Algorithm 1 describes the working memory mechanism of Pace LLM, which dynamically enhances current FFN activations using a memory bank. It consists of three key phases: retrieval, enhancement, and memory update. |
| Open Source Code | No | All of the code and data used in this study as well as the necessary documentation to run it will be released upon acceptance. |
| Open Datasets | Yes | We evaluate Pace LLM on three established long-context benchmarks: Long Bench [2], ๐-Bench [46] and Needle-In-A-Haystack (NIAH) [18]. To evaluate the generalization ability of our method beyond long-context tasks, we also evaluate on MMLU [13], which features shorter context lengths. |
| Dataset Splits | Yes | We evaluate Pace LLM on three established long-context benchmarks: Long Bench [2], ๐-Bench [46] and Needle-In-A-Haystack (NIAH) [18]. To evaluate the generalization ability of our method beyond long-context tasks, we also evaluate on MMLU [13], which features shorter context lengths. We further evaluate on Needle-In-A-Haystack (NIAH) following the official settings [18] |
| Hardware Specification | Yes | All experiments are conducted with 4 A100-40G GPUs. |
| Software Dependencies | No | The paper mentions using Qwen-2-7B-Instruct [39] and Llama-2-7B-chat [28] as base models, and discusses compatibility with Flash Attention. However, it does not provide specific version numbers for software libraries or dependencies like Python, PyTorch, or TensorFlow. |
| Experiment Setup | Yes | Implementation. We apply Pace LLM to Llama-2-7B-chat [28] and Qwen-2-7B-Instruct [39] in training-free and low-cost fine-tuning settings. For low-cost fine-tuning, we follow the setting of Activation Beacon [45]. All experiments are conducted with 4 A100-40G GPUs. In the base setting, the bank capacity M is set to 100, the fusion threshold ฮธhigh is 0.7 and ฮธlow is 0.3, with AMB applied to the 13th and 27th layers. |