Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Attention-based clustering

Authors: Rodrigo Maulen Soto, Pierre Marion, Claire Boyer

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our theoretical findings are supported by numerical experiments under varying conditions, including different initialization regimes, mixture separability levels, and problem dimensionalities. Overall, we show that attention-based predictors can successfully adapt to mixture models by learning the underlying centroids through training.
Researcher Affiliation	Academia	Rodrigo Maulen-Soto Sorbonne Université, LPSM Paris, France EMAIL Pierre Marion EPFL, Institute of Mathematics Lausanne, Switzerland Claire Boyer Laboratoire de Mathématiques d Orsay Université Paris Saclay Institut Universitaire de France EMAIL
Pseudocode	No	The method Projected Stochastic Gradient Descent (PSGD), which we run for our numerical experiments. PSGD iterates for linear attention heads. Given the objective function Rρ : (Sd 1)2 R defined in (Pρ), we define h : (Sd 1)2 RL d as ... Then, given and an initialization (µ0 0, µ0 1) (Sd 1)2, a stepsize γ, we define (µk 0, µk 1) (Sd 1)2 recursively by: µk+1 0 = µk 0 γ(Id µk 0(µk 0) )gk 0 µk 0 γ(Id µk 0(µk 0) )gk 0 2 , µk+1 1 = µk 1 γ(Id µk 1(µk 1) )gk 1 µk 1 γ(Id µk 1(µk 1) )gk 1 2 ,
Open Source Code	Yes	Code availability Our code is available at https://github.com/rodrigomaulen/Attention-based-clustering
Open Datasets	No	input data is generated from a Gaussian mixture model. We use input sequences of length L = 30 of 5-dimensional tokens (d = 5), and define the true centroids as µ 0 = (0, 0, 0, 0, 1) and µ 1 = ( 1, 0, 0, 0, 0).
Dataset Splits	No	In what follows, we use the metric referred to as distance to the centroids (up to a sign), given by min π S2 min s { 1,1}2 i=0 ˆµπ(i) siµ i 2, where S2 is the permutation group of two elements, µ 0, µ 1 denote the true centroids, respectively, while ˆµ0, ˆµ1 are the parameters returned by (PSGD).
Hardware Specification	No	All experiments in Section A and 3 can be run on a standard laptop. Most complete within a few minutes, with the exception of those in Figures 6a and 2a, which require approximately 20 minutes and up to an hour, respectively, due to repeated problem-solving across a grid of regularization strengths.
Software Dependencies	Yes	Gradient computations in the numerical experiments were carried out using JAX (Bradbury et al., 2018).
Experiment Setup	Yes	We use input sequences of length L = 30 of 5-dimensional tokens (d = 5)... we perform 104 (PSGD) iterations without regularization (ρ = 0) with a learning rate of γ = 0.01, λ = 0.6, batch size M = 256. The experiment is repeated across 10 independent runs, each initialized randomly on the manifold M.