Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Composing Linear Layers from Irreducibles

Authors: Travis Pence, Daisuke Yamada, Vikas Singh

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We empirically evaluate rotors by replacing key, query, and value linear layers in pretrained LLMs and measuring downstream performance on perplexity (PPL) and accuracy. Our experiments span multiple models and datasets. The main goals are to: (G-1) Demonstrate the feasibility of composing linear layers from bivector primitives by assessing whether rotors match baseline performance of Low-Rank and Block-Hadamard approximations across diverse settings. (G-2) Quantify rotor parameter efficiency compared to dense and approximate alternatives. (G-3) Analyze how rotor architectural choices such as width and depth affect performance.
Researcher Affiliation Academia University of Wisconsin-Madison EMAIL, EMAIL
Pseudocode Yes Algorithm 1 Differentiable Inv. Decomp. Algorithm 2 GA Power Iteration
Open Source Code Yes Our full codebase, including datasets and hyperparameters for all experiments, is available at https://github.com/vsingh-group/Composing Linear Layers.
Open Datasets Yes We evaluate on two pre-trained LLMs: LLa Ma-3.2 1B [Touvron et al., 2023] and Qwen-2.5 1.5B [Qwen et al., 2025]. Metrics include log perplexity ( ) on three language modeling datasets Wikitext2, C4 [Dodge et al., 2021], and PTB [Marcus et al., 1993] and accuracy ( ) on two multiple-choice benchmarks Arc Challenge [Clark et al., 2018] and Hella Swag [Zellers et al., 2019]. ... FMNIST [Xiao et al., 2017]
Dataset Splits Yes To fit each substitute layer (rotor, LR, or BH), we extract hidden states from the pre-trained model and minimize MSE between the projected outputs of the original and approximated layers. Each variant is trained independently using the Adam optimizer [Kingma and Ba, 2017]. In our rotor architecture, depth refers to the number of stacked rotor maps ψ, while width denotes the number of parallel rotor maps within each layer. ... For each layer, we first replace all earlier trained layers (e.g., I before J), and then extract the input-output data for training the new layer under this modified model. This is to ensure that each replacement layer is trained with respect to the distribution induced by preceding replacements. Also, whenever we replace layer L, we retrain the output linear projection W L o within the same attention block, using the same MSE and Adam optimizer for consistency. ... We trained both dense and rotor-based MLPs under identical conditions. ... We performed a search over learning rates η (0.001, 0.1) and selected η = 0.005 for the rotor-based model and η = 0.002 for the dense baseline based on validation accuracy.
Hardware Specification Yes The experiments for each LLa Ma-3.2 1B and Qwen-2.5 1.5B each took around 1500 GPU hours, Fox-1.0 1.6B around 1000 GPU hours, and LLa Ma-3.2 3B around 500 GPU hours for a total of around 4500 GPU hours. This was spread across 8 NVIDIA A100 PCIe GPUs with 40 GBs of HBM2 memory.
Software Dependencies No All Clifford algebraic operations, including exponentiation of simple bivectors and sandwich products, are implemented entirely in Py Torch using the torch_ga library Alesiani, which supports differentiation.
Experiment Setup Yes Hyperparameters such as depth, width, learning rate, and weight decay are selected via grid search; the final values along with the values we explored are listed in Tab. 6. ... Table 6: Hyperparameter settings used for each method. ... All architectural details and hyperparameters are provided in Appendix C.