Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Pool Me Wisely: On the Effect of Pooling in Transformer-Based Models

Authors: Sofiane Ennadir, Levente Zólyomi, Oleg Smirnov, Tianze Wang, John Pertoft, Filip Cornell, Lele Cao

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We empirically evaluate pooling strategies across tasks requiring both global and local contextual understanding, spanning three major modalities: computer vision, natural language processing, and time-series analysis. Results reveal consistent trends in how pooling choices affect accuracy, sensitivity, and optimization behavior.
Researcher Affiliation Industry Sofiane Ennadir King AI Labs, Microsoft Gaming EMAIL Levente Zólyomi NXAI Gmb H EMAIL Oleg Smirnov King AI Labs, Microsoft Gaming EMAIL Tianze Wang Kreditz AB EMAIL John Pertoft King AI Labs, Microsoft Gaming EMAIL Filip Cornell Amazon EMAIL Lele Cao King AI Labs, Microsoft Gaming EMAIL
Pseudocode No The paper describes methods mathematically and textually (e.g., Section 3 Preliminaries and Section 4 On the Expressivity of Transformer-Based Models), but it does not contain a clearly labeled pseudocode or algorithm block.
Open Source Code Yes Our code and implementation to reproduce the results is available in the following link: https://github.com/king/transformer-pooling.
Open Datasets Yes For each modality, we select a diverse set of established benchmarks with tasks requiring global and local contexts. ... (a) computer vision, (b) natural language processing, and (c) time series analysis. ... For classification (CIFAR 10/100 [17], Image Net-100 [34], Mini Places [47], Caltech-UCSD Birds (CUB) [42]), ... For inpainting (Celeb A [25], Oxford Flowers [29], Oxford-IIIT Pet [30]), ... For segmentation (Pascal-VOC [9]), ... For STSB (Spearman) [3], ... In the Hella Swag [46] task, ... For next-token prediction, we used the Tiny Stories [7] corpus ... The model was pretrained on the Open Web Text [11] corpus ... We evaluate three pretrained checkpoints (Auton Lab/MOMENT-1-{small, base, large}) trained on the Time Series Pile dataset [12].
Dataset Splits No Each dataset s training split was used for fine-tuning. Hyperparameters were selected based on validation performance (where available), and final results were reported on the held-out test set. To maintain consistent input dimensions, all sequences were padded or truncated to a predefined maximum length. Tokenization was done using each model s default tokenizer, and [PAD] tokens were used for padding. For Tiny Stories [7] corpus ... The training set comprised 4000 batches randomly sampled from the corpus. A randomly initialized language modeling head was trained to predict the next token based on preceding context. We used a held out test-set and randomly sampled tokens to predict.
Hardware Specification Yes All the experiments were run using a single NVIDIA L4 GPU and took 25 GPU hours to obtain all results. ... Experiments were conducted on an instance with 2 NVIDIA L4 GPUs using Py Torch [31] with the Distributed Data Parallel framework and a batch size of 32 per GPU. Running all experiments took 1832 GPU hours on L4 GPUs. ... Training was conducted on 8 NVIDIA L4 GPUs and took about 960 GPU hours. ... with a batch size of 64 on a single NVIDIA L4 GPU.
Software Dependencies No Experiments were conducted on an instance with 2 NVIDIA L4 GPUs using Py Torch [31] with the Distributed Data Parallel framework... Optimization was performed using the Adam [16] optimizer... The paper mentions software like PyTorch and Adam optimizer but does not provide specific version numbers for any of these dependencies.
Experiment Setup Yes We optimized with Adam [16] at a learning rate of 1 10 3. All the tasks were trained for 10 epochs... Optimization was performed using the Adam [16] optimizer with a learning rate of 1 10 3. Ten epochs of fine-tuning consistently yielded stable convergence across tasks. Each experiment was repeated five times with fixed random seeds... batch size of 32 per GPU... GPT-2 Pretraining: ... for 60 000 iterations using a batch size of 12, block size of 1024, and 40 gradient accumulation steps... Optimization was performed using Adam [16] with a learning rate of 1 10 3 with a batch size of 64 on a single NVIDIA L4 GPU. For classification, we run optimization for 20 epochs... For forecasting, we trained the prediction head for 10 epochs... For imputation, we trained the prediction head for 10 epochs...