Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Bigram Subnetworks: Mapping to Next Tokens in Transformer Language Models
Authors: Tyler Chang, Benjamin Bergen
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We find that bigram subnetworks can be found in fully trained language models up to 1B parameters, and these subnetworks are critical for model performance even when they consist of less than 0.2% of model parameters. Bigram subnetworks are concentrated in the first Transformer MLP layer, and they overlap significantly with subnetworks trained to optimally prune a given model. Mechanistically, the bigram subnetworks often recreate a pattern from the full models where the first layer induces a sharp change that aligns activations with next token predictions rather than current token representations. Our results demonstrate that bigram subnetworks comprise a minimal subset of parameters that are both necessary and sufficient for basic next token predictions in language models, and they help drive the transformation from current to next token activations in the residual stream. |
| Researcher Affiliation | Academia | Tyler A. Chang Department of Cognitive Science University of California San Diego EMAIL Benjamin K. Bergen Department of Cognitive Science University of California San Diego EMAIL |
| Pseudocode | No | The paper describes methods and procedures in prose, but does not include any explicitly labeled 'Pseudocode', 'Algorithm', or structured code-like blocks. |
| Open Source Code | Yes | Code and trained bigram subnetworks at: https://github.com/tylerachang/bigram-subnetworks. |
| Open Datasets | Yes | All bigram subnetworks are trained and evaluated using English web text data from OSCAR (Abadji et al., 2021). ... We estimate the bigram distribution by counting bigram frequencies in 1.28B tokens of OSCAR web text (10M sequences of 128 tokens; Abadji et al., 2021) |
| Dataset Splits | Yes | We train each subnetwork on sequences of 128 tokens with batch size 32 and learning rate 5e-5, and we set the mask sigmoid temperature to divide by 1.001 per training step (dictating how fast M approaches a binary mask). ... We evaluate this loss on a held out subset of 1.2M tokens from OSCAR (Abadji et al., 2021). |
| Hardware Specification | Yes | Each subnetwork takes approximately four to twelve hours to train on a single NVIDIA RTX A6000 (48GB) GPU, depending primarily on model size. |
| Software Dependencies | No | The paper mentions methods like 'continuous sparsification' and references models like 'Pythia' and 'GPT-2', but does not specify software library names with version numbers (e.g., PyTorch 1.9, Python 3.8). |
| Experiment Setup | Yes | To find bigram subnetworks, we use continuous sparsification ( 3.1; Savarese et al., 2020; Lepori et al., 2023b), which optimizes a mask M over frozen model parameters to minimize a given loss function. ... To optimize M to mimic the bigram distribution P, we use the following loss function for model input x: Loss(M, x) = Cross Entropy P(x), Masked Model M(x) + λ||M||1 ... we train subnetworks for λ [0, 1, 5, 10, 50, 100, 500, 1000] to evaluate the effects of sparsity on subnetwork performance. ... We train each subnetwork on sequences of 128 tokens with batch size 32 and learning rate 5e-5, and we set the mask sigmoid temperature to divide by 1.001 per training step (dictating how fast M approaches a binary mask). |