Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

CAT: Circular-Convolutional Attention for Sub-Quadratic Transformers

Authors: Yoshihiro Yamada

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We validate CAT on large-scale vision (Image Net-1k) and language (Wiki Text-103) tasks, demonstrating consistent speedups over standard attention and comparable or improved accuracy. In addition, an ablation study reveals key design factors, such as merging query and key projections, that enable CAT to serve as a drop-in replacement under the EIT principles. Our main contributions are as follows. Empirical validation. On Image Net-1k and Wiki Text-103, CAT consistently matches or exceeds standard attention under simpler token mixing (e.g. average pooling, masked inputs), providing speedup in naive implementations.
Researcher Affiliation Industry Yoshihiro Yamada Preferred Networks EMAIL
Pseudocode No The paper describes the methods textually and mathematically using equations and descriptive paragraphs, but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code No Justification: We use publicly available datasets (Image Net-1K, Wiki Text-103), but we have not released our code due to ongoing internal requirements.
Open Datasets Yes We validate CAT on large-scale vision (Image Net-1k) and language (Wiki Text-103) tasks... We evaluate our CAT on two major benchmarks: Image Net-1k Russakovsky et al. [2015] for image classification and Wiki Text-103 Merity et al. [2016] for language modeling.
Dataset Splits Yes We train our models on the Image Net-1k training dataset Russakovsky et al. [2015]... For language modeling on the Wiki Text-103 training dataset Merity et al. [2016]... We report standard validation accuracy on Image Net-1k and validation word perplexity (word PPL) on Wiki Text-103.
Hardware Specification Yes We train our models on the Image Net-1k training dataset Russakovsky et al. [2015] using a batch size of 256 and a standard input resolution of 224 224 on 4 NVIDIA V100 GPUs. For language modeling on the Wiki Text-103 training dataset Merity et al. [2016], we train with a batch size of 128 and an initial learning rate of 2.5 10 4 on 4 NVIDIA V100 GPUs. All measurements were conducted on NVIDIA V100 GPUs (FP16 precision, batch size = 32) using the Adam W optimizer.
Software Dependencies Yes Experiments used Py Torch 2.7.0 + cu126 + cu DNN 9.5.1, with cu FFT as the FFT backend for CAT and Flash Attention (torch.nn.functional.scaled_dot_product_attention) for Self-Attention baselines.
Experiment Setup Yes We train our models on the Image Net-1k training dataset Russakovsky et al. [2015] using a batch size of 256 and a standard input resolution of 224 224 on 4 NVIDIA V100 GPUs. The initial learning rate is set to 2 10 5, with weight decay of 1 10 4. All models are randomly initialized. We train for 50 epochs, applying a 10-epoch warmup phase followed by a cosine-annealing scheduler. We use Adam W with default hyperparameters (i.e., β1 = 0.9, β2 = 0.999). Data augmentation consists of random cropping and horizontal flipping. For language modeling on the Wiki Text-103 training dataset Merity et al. [2016], we train with a batch size of 128 and an initial learning rate of 2.5 10 4 on 4 NVIDIA V100 GPUs. We run 50 total epochs, employing a 1000-iteration warmup. We set the maximum sequence length to 256. A dropout rate of 0.1 is applied, and gradient norms are clipped at a maximum of 0.25. As with Image Net-1k, we use Adam W Loshchilov and Hutter [2019] under default settings unless stated otherwise. Models are also randomly initialized in this setup. Finally, for masked language modeling experiments, we use a masking probability of 0.15.