Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Towards Understanding the Universality of Transformers for Next-Token Prediction

Authors: Michael Sander, Gabriel Peyré

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We present experimental results that validate our theoretical findings and suggest their applicability to more general mappings f. In Section 5, we first present experimental results that validate our theoretical findings and extend them to a more general class of mappings f beyond those studied in Sections 3 and 4. The paper includes a dedicated section titled 'EXPERIMENTS' where empirical results are presented and discussed, including error curves and training details.
Researcher Affiliation Academia Michaël E. Sander & Gabriel Peyré Ecole Normale Supérieure, CNRS Paris, France EMAIL, EMAIL. Both authors are affiliated with academic institutions (Ecole Normale Supérieure, CNRS), and their email domains (.polytechnique.org, .ens.fr) correspond to academic institutions.
Pseudocode No The paper describes methods mathematically and conceptually (e.g., equation (7) for causal kernel descent and Figure 1 for illustration), but it does not contain any clearly labeled pseudocode or algorithm blocks.
Open Source Code No Our code will be open-sourced. This statement indicates a future intention to release code, but does not provide concrete access to source code at the time of publication.
Open Datasets No We take d = 15, n = 6, and consider instance (2) with randomly generated Ω s and x1 s, for a dataset with 212 elements, that we split into train, validation, and test sets with respective sizes of 60%, 20%, and 20% of the original dataset. The dataset used was generated by the authors ('randomly generated') and no public access information (link, DOI, repository, or citation) is provided.
Dataset Splits Yes We take d = 15, n = 6, and consider instance (2) with randomly generated Ω s and x1 s, for a dataset with 212 elements, that we split into train, validation, and test sets with respective sizes of 60%, 20%, and 20% of the original dataset.
Hardware Specification No The paper does not provide any specific details about the hardware used to run the experiments, such as GPU or CPU models.
Software Dependencies No The paper mentions using 'Adam (Kingma & Ba, 2014)' for training but does not provide specific version numbers for any software, libraries, or frameworks used (e.g., PyTorch, TensorFlow, Python version).
Experiment Setup Yes We take d = 15, n = 6, and consider instance (2)... We train the model using Adam (Kingma & Ba, 2014) on the Mean Squared Error (MSE) loss for next-token prediction on sequences of length T = 100... We train for 5000 epochs with early stopping.