Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

L$^2$M: Mutual Information Scaling Law for Long-Context Language Modeling

Authors: Zhuo Chen, Oriol Comas, Zhuotao Jin, Di Luo, Marin Soljacic

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We validate the framework and its predictions on transformer and state-space models. Our work provides a principled foundation to understand long-context modeling and to design more efficient architectures with stronger long-context capabilities, with potential applications beyond natural language. ... 4. We validate our framework and its predictions across transformer and state-space model (SSM) architectures on both synthetic and natural language datasets of varying lengths. ... 6 Empirical Verification Sub-volume-law Gaussian Test. We first validate our theory using a synthetic dataset comprising a family of multivariate Gaussian distributions (see Appx. C for details). ... PG19 Test. We then extend our analysis to the PG19 dataset [90], a high-quality collection of pre-1919 books exhibiting long contextual dependencies. In Fig. 4, we show the position-wise conditional negative log likelihood (NLL) of models trained on the PG19 dataset [90] with 8192-token sequences, where calculating KL-divergence is not feasible.
Researcher Affiliation Academia Zhuo Chen12 Oriol Mayné i Comas23 Zhuotao Jin24 Di Luo1245 Marin Soljaˇci c12 1 NSF AI Institute for Artificial Intelligence and Fundamental Interactions 2 Massachusetts Institute of Technology 3 Polytechnic University of Catalonia 4 Harvard University 5 University of California, Los Angeles EMAIL
Pseudocode No The paper describes methods and derivations but does not include any explicitly labeled pseudocode or algorithm blocks. The procedural descriptions are integrated into the main text.
Open Source Code Yes F.VII Code Availability The code for reproducing our mutual information estimation and the PG19 results is available at https://github.com/LSquared M/mutual_info_scaling_law.
Open Datasets Yes Our empirical analysis in Fig. 2(b) focuses on equal-length partitions of X and Y (ℓ= L/2), where the bipartite mutual information tends to maximize for fixed L s. Nevertheless, the same analysis can be carried out using other partitions where similar results can be obtained (with the results in Appx. B.II). Using both the bias-corrected direct estimator [Eq.(6)] and v CLUB estimator [Eq. (7)], we measure scaling on the PG19 dataset* [90] (a collection of books before 1919), employing the LLa MA 3.1 405B model [89] as density estimator q. All measurements robustly demonstrate a clear power-law scaling that extends across thousands of tokens. Additional measurements on WIKIPEDIA [101] and using additional LLMs, along with varying ℓ/L ratios, can be found in Appx. B.I and B.II.
Dataset Splits Yes For the PG19 dataset, we train on standard average negative log likelihood. We first split the dataset into samples with a length of approximately 1.2 times the target length, ensuring each sample starts at the beginning of a sentence. We then train the models for 5 epochs (approximately 450k iterations) with a batch size of 16,384 tokens. To maintain consistency across different models, we always use the same tokenizer from GPT-Neo-X [111]. ... The results reported are at the end of training using a separate evaluation dataset containing 10,000 samples.
Hardware Specification Yes F.VI Computational Resources and Implementation Details Our experiments are performed primarily on H100 GPUs, with varying VRAM sizes between 80GB and 96GB. Some experiments use A100 GPUs with 80GB VRAM instead.
Software Dependencies No The paper mentions several software components like "v LLM library [68]", "Py Torch [113]", and "Hugging Face transformers library". However, it does not provide specific version numbers for these key software dependencies, which are necessary for reproducible descriptions.
Experiment Setup Yes F.III Training Configuration For the Gaussian distribution training, during each iteration, we use a batch size of 4 (4 times sequence length number of tokens) with freshly generated samples, meaning we never reuse any sample. We therefore have effectively a single epoch, thanks to the infinite dataset size. We train all neural networks using the Adam W optimizer and a cosine decay scheduler with warmup. We use a peak learning rate of 5 10 5, a weight decay of 0.01, 2000 warmup steps, and 500,000 training steps in total. The results reported are at the end of training. For the PG19 dataset experiments, we use similar hyperparameters: Adam W optimizer with a cosine decay scheduler with warmup, peak learning rate of 5 10 5, weight decay of 0.01, 2000 warmup steps, and 500,000 steps in total. The results reported are at the end of training using a separate evaluation dataset containing 10,000 samples.