Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Attention-based clustering

Authors: Rodrigo Maulen Soto, Pierre Marion, Claire Boyer

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our theoretical findings are supported by numerical experiments under varying conditions, including different initialization regimes, mixture separability levels, and problem dimensionalities. Overall, we show that attention-based predictors can successfully adapt to mixture models by learning the underlying centroids through training.
Researcher Affiliation Academia Rodrigo Maulen-Soto Sorbonne Université, LPSM Paris, France EMAIL Pierre Marion EPFL, Institute of Mathematics Lausanne, Switzerland Claire Boyer Laboratoire de Mathématiques d Orsay Université Paris Saclay Institut Universitaire de France EMAIL
Pseudocode No The method Projected Stochastic Gradient Descent (PSGD), which we run for our numerical experiments. PSGD iterates for linear attention heads. Given the objective function Rρ : (Sd 1)2 R defined in (Pρ), we define h : (Sd 1)2 RL d as ... Then, given and an initialization (µ0 0, µ0 1) (Sd 1)2, a stepsize γ, we define (µk 0, µk 1) (Sd 1)2 recursively by: µk+1 0 = µk 0 γ(Id µk 0(µk 0) )gk 0 µk 0 γ(Id µk 0(µk 0) )gk 0 2 , µk+1 1 = µk 1 γ(Id µk 1(µk 1) )gk 1 µk 1 γ(Id µk 1(µk 1) )gk 1 2 ,
Open Source Code Yes Code availability Our code is available at https://github.com/rodrigomaulen/Attention-based-clustering
Open Datasets No input data is generated from a Gaussian mixture model. We use input sequences of length L = 30 of 5-dimensional tokens (d = 5), and define the true centroids as µ 0 = (0, 0, 0, 0, 1) and µ 1 = ( 1, 0, 0, 0, 0).
Dataset Splits No In what follows, we use the metric referred to as distance to the centroids (up to a sign), given by min π S2 min s { 1,1}2 i=0 ˆµπ(i) siµ i 2, where S2 is the permutation group of two elements, µ 0, µ 1 denote the true centroids, respectively, while ˆµ0, ˆµ1 are the parameters returned by (PSGD).
Hardware Specification No All experiments in Section A and 3 can be run on a standard laptop. Most complete within a few minutes, with the exception of those in Figures 6a and 2a, which require approximately 20 minutes and up to an hour, respectively, due to repeated problem-solving across a grid of regularization strengths.
Software Dependencies Yes Gradient computations in the numerical experiments were carried out using JAX (Bradbury et al., 2018).
Experiment Setup Yes We use input sequences of length L = 30 of 5-dimensional tokens (d = 5)... we perform 104 (PSGD) iterations without regularization (ρ = 0) with a learning rate of γ = 0.01, λ = 0.6, batch size M = 256. The experiment is repeated across 10 independent runs, each initialized randomly on the manifold M.