Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

QuadEnhancer: Leveraging Quadratic Transformations to Enhance Deep Neural Networks

Authors: Qian Chen, Linxin Yang, Akang Wang, Xiaodong Luo, Yin Zhang

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We conduct a set of proof-of-concept experiments for the proposed method across three tasks: image classification, text classification, and fine-tuning large-language models. In all tasks, the proposed approach demonstrates clear and substantial performance gains.
Researcher Affiliation	Academia	Qian Chen1,2, Linxin Yang2,3, Akang Wang2,3,, Xiaodong Luo2,3, and Yin Zhang3, 1School of Science and Engineering, The Chinese University of Hong Kong, Shenzhen, China 2Shenzhen International Center for Industrial and Applied Mathematics, Shenzhen Research Institute of Big Data, China 3School of Data Science, The Chinese University of Hong Kong, Shenzhen, China
Pseudocode	No	No explicit pseudocode or algorithm blocks are provided. The methodology is described using mathematical equations and a workflow diagram (Figure 2).
Open Source Code	Yes	The experimental code is publicly available at https://github.com/chitar/Quad Enhancer.
Open Datasets	Yes	Our experiments begin with Image Net-1k for the initial pre-training stage. For downstream evaluation, we use six widely recognized benchmarks: Caltech [9], CIFAR-10, CIFAR-100 [20], Flowers [32], Food [2], and Pets [33]. For pre-training, we use the Wiki Text-2 dataset [30]... For downstream text classification, we utilize six standard benchmarks: IMDB (movie review sentiment analysis) [27], Yelp (restaurant review sentiment) [17], AG-News (topic classification) [49], SST-2 (Stanford Sentiment Treebank) [41], and Emotion (emotion recognition) [39]. We use several benchmark datasets including Bool Q, PIQA, SIQA, Hella Swag, Wino Grande, ARC-e, ARC-c, and OBQA. Detailed descriptions of these datasets are provided in the appendix of [16].
Dataset Splits	Yes	Following common practice, our experiments involve initial pre-training on a largescale dataset, subsequently fine-tuning the pre-trained models on various target datasets. Specifically, we first pre-train on Image Net-1k [20], then fine-tune and evaluate the models across several diverse downstream datasets. Each model is pre-trained on Wiki Text-2 for 20 epochs... Following pre-training, models are fine-tuned on each classification dataset for 10 epochs... We use several benchmark datasets including Bool Q, PIQA, SIQA, Hella Swag, Wino Grande, ARC-e, ARC-c, and OBQA.
Hardware Specification	Yes	All experiments were conducted using four NVIDIA A100 80GB.
Software Dependencies	No	The paper discusses FP16 precision and the LoRA algorithm, but does not provide specific software names with version numbers for libraries or frameworks used in the implementation.
Experiment Setup	Yes	The training parameters, including batch size, learning rate, number of epochs, and total training duration, are consistent with the settings outlined in [29]. Each model is pre-trained on Wiki Text-2 for 20 epochs, with batch size 128, learning rate 0.0001, and a maximum sequence length of 256 tokens... Following pre-training, models are fine-tuned on each classification dataset for 10 epochs, with learning rate 0.00005, batch size 16, and other optimizer settings unchanged. Our training configurations follow the established settings from prior works [16, 26].