Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

GSPN-2: Efficient Parallel Sequence Modeling

Authors: Hongjun Wang, yitong jiang, Collin McCarthy, David Wehr, Hanrong Ye, Xinhao Li, Ka Chun Cheung, Wonmin Byeon, Jinwei Gu, Ke Chen, Kai Han, Hongxu Yin, Pavlo Molchanov, Jan Kautz, Sifei Liu

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments demonstrate GSPN-2 s effectiveness across image classification and text-to-image synthesis tasks, matching transformer-level accuracy with significantly lower computational cost. Our experimental evaluation comprehensively validates GSPN-2. Rigorous efficiency analysis demonstrates that GSPN-2 runs up to 30 faster than GSPN-1 across diverse input configurations, with performance profiling confirming near-optimal hardware utilization (over 90% of theoretical peak memory bandwidth). We then validate GSPN-2 s effectiveness across vision tasks: image classification and text-to-image synthesis.
Researcher Affiliation Collaboration Hongjun Wang1,2, , Yitong Jiang1, Collin Mc Carthy1, David Wehr1, Hanrong Ye1, Xinhao Li3, Ka Chun Cheung1, Wonmin Byeon1, Jinwei Gu1, Ke Chen1, Kai Han2 , Hongxu Yin1, Pavlo Molchanov1, Jan Kautz1, Sifei Liu1 1NVIDIA 2The University of Hong Kong 3University of California, San Diego
Pseudocode No The paper describes the GSPN recurrence relation in Equation (1) and (2) and discusses the CUDA implementation details, but it does not present a structured pseudocode or algorithm block.
Open Source Code No Besides, we will release the code upon publication.
Open Datasets Yes On Image Net, GSPN-2 achieves accuracy comparable to transformer models at significantly lower computational cost. In text-to-image synthesis, GSPN-2 significantly improves semantic consistency and visual quality when integrated with existing diffusion models. ... We evaluate GSPN-2 s text-to-image generation on the COCO benchmark. ... Semantic segmentation via linear probe evaluation on ADE20K [64] and Pascal VOC [65]. ... Segmentation and depth estimation on NYUDv2 [68] and Pascal Context [69]. ... trained the model on a 1M subsample of the Data Comp-1B dataset
Dataset Splits Yes In Table 2, we present a comparative analysis of Image Net-1K classification performance across three architectural paradigms: Conv Net-based [29, 30], Transformer-based [31, 33, 36, 35, 46, 49], and sequential-based (RS scan) models [25, 26, 37, 27, 40, 38, 39] of varying sizes. ... We compare GSPN-2 with several relevant baselines and its predecessor, GSPN-1, on the COCO benchmark, with all models generating images at a 1024 x 1024 resolution.
Hardware Specification Yes on an NVIDIA A100 GPU, runtime for a representative 1024x1024x8 input drops from over 71.4 ms in GSPN-1 down to just 1.8 ms in GSPN-2, achieving a 40x speedup (as shown in Figure 3). ... NVIDIA Nsight Compute profiling indicates that GSPN-2 achieves memory throughput near the theoretical limit, with global-memory efficiency reaching 93% on A100 GPUs. ... We use 32 nodes with 8x A100-80GB GPUs each.
Software Dependencies No The paper discusses 'CUDA implementation' and mentions 'MMSegmentation [66]' as a tool used, but does not provide specific version numbers for these software components, nor for any other key libraries or programming languages.
Experiment Setup Yes For GSPN-2 models, the Image Net experiments incorporate several key design choices: propagation weights wi are shared across channels in all GSPN modules, and a compressive proxy dimension Cproxy is set to 2. This reduction in channel dimensionality allows the saved parameters to be reallocated for deeper or wider network architectures. Additionally, we integrate the Local Perception Unit (LPU) [52] at the beginning of each block and FFN. The MESA [53] technique is also employed to mitigate overfitting, contributing a further 0.2% accuracy improvement to some variants. ... We train for 100K iterations using the LAMB optimizer [63], a base learning rate of 1e-3, a weight decay of 0.01, and a cosine learning rate decay. ... trained the model on a 1M subsample of the Data Comp-1B dataset with batch size 8096 and 12k iterations and evaluated performance on Image Net Zero-shot.