Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

CrossSpectra: Exploiting Cross-Layer Smoothness for Parameter-Efficient Fine-Tuning

Authors: Yifei Zhang, Hao Zhu, Junhao Dong, Haoran Shi, Ziqiao Meng, Piotr Koniusz, Han Yu

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We provide theoretical and empirical evidence that skip connections in transformers create smooth gradient propagation across layers. This smoothness leads to weight adaptations that concentrate most of their energy in low-frequency spectral components, especially along the layer dimension. Empirical analysis confirms this effect, showing that most of adaptation energy lies in low frequencies. Through extensive experiments across natural language understanding, instruction tuning, and image classification tasks, we show that Cross Spectra matches or exceeds baseline performance while using a fraction of the parameters.
Researcher Affiliation Collaboration 1School of Computer Science, Northwestern Polytechnical University; 2CCDS, Nanyang Technological University; 3National University of Singapore; 4Data61 CSIRO; 5University of New South Wales; 6Australian National University
Pseudocode Yes Algorithm 1 Cross Spectra Forward Pass Algorithm 2 Cross Spectra Backward Pass
Open Source Code No v. Open access to data and code Answer: [No] Justification: We promise code will be public available when paper get accepted
Open Datasets Yes Image Classification (IC): We fine-tune CLIP Vi T-B/32 [Radford et al., 2021] on 7 standard image datasets including Stanford Cars, DTD, Euro SAT, GTSRB, RESISC45, SUN397, and SVHN [Ilharco et al., 2023]. Natural Language Understanding (NLU): We fine-tune Ro BERTa-large [Liu, 2019] on the GLUE benchmark [Raffel et al., 2020b]. Commonsense Reasoning (CR): We use LLa MA2-7B [Touvron et al., 2023] on 8 reasoning benchmarks: Bool Q, PIQA, SIQA, Hella Swag, Wino Grande, ARC-e, ARC-c, and OBQA. Arithmetic Reasoning (AR): Using LLa MA2-7B, we evaluate mathematical reasoning capabilities on GSM8K [Cobbe et al., 2021], MAWPS [Koncel-Kedziorski et al., 2016], SVAMP [Patel et al., 2021], and AQu A [Ling et al., 2017] benchmarks.
Dataset Splits Yes Commonsense Reasoning (CR): We use LLa MA2-7B [Touvron et al., 2023] on 8 reasoning benchmarks: Bool Q, PIQA, SIQA, Hella Swag, Wino Grande, ARC-e, ARC-c, and OBQA. Following Hu et al. [2023], we combine training datasets from all tasks and evaluate on each test set separately. These diverse tasks let us verify that the cross-layer spectral structure we exploit exists across various model architectures and domains.
Hardware Specification No The paper mentions 'Modern FFT implementations on GPU further reduce this overhead through optimized memory access patterns and parallelization.' and Table 9 and 10 provide 'Time/Epoch (s)' and 'FFT Time (s)' respectively, but no specific GPU models, CPU types, or detailed computer specifications are provided.
Software Dependencies No The paper mentions 'All models are trained using Adam optimizer [Kingma and Ba, 2014] with batch size 64 and cosine learning rate scheduling.' and 'Using torch.fft.ifftn'. It mentions specific optimizers and functions, but not specific version numbers for software libraries like PyTorch or Python itself.
Experiment Setup Yes For Cross Spectra, we adapt all query, key, and value projection matrices in transformer attention blocks, except in image classification tasks where we only adapt query and key matrices following standard practice. For frequency sparsity, we set the number of non-zero coefficients |Ω| = 3000 (corresponding to approximately k1 = 1000 samples per layer slice and k2 = 3 frequencies in the layer dimension). This represents just 0.1-0.5% of the full parameter space depending on model size. For baseline comparisons, we use Lo RA with rank r = 16 and r = 32. All models are trained using Adam optimizer [Kingma and Ba, 2014] with batch size 64 and cosine learning rate scheduling. For image classification, we use separate learning rates: 1e 3 for the classification layer and 1e 5 for adaptation parameters.