Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Towards Understanding the Universality of Transformers for Next-Token Prediction
Authors: Michael Sander, Gabriel Peyré
ICLR 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We present experimental results that validate our theoretical findings and suggest their applicability to more general mappings f. In Section 5, we first present experimental results that validate our theoretical findings and extend them to a more general class of mappings f beyond those studied in Sections 3 and 4. The paper includes a dedicated section titled 'EXPERIMENTS' where empirical results are presented and discussed, including error curves and training details. |
| Researcher Affiliation | Academia | Michaël E. Sander & Gabriel Peyré Ecole Normale Supérieure, CNRS Paris, France EMAIL, EMAIL. Both authors are affiliated with academic institutions (Ecole Normale Supérieure, CNRS), and their email domains (.polytechnique.org, .ens.fr) correspond to academic institutions. |
| Pseudocode | No | The paper describes methods mathematically and conceptually (e.g., equation (7) for causal kernel descent and Figure 1 for illustration), but it does not contain any clearly labeled pseudocode or algorithm blocks. |
| Open Source Code | No | Our code will be open-sourced. This statement indicates a future intention to release code, but does not provide concrete access to source code at the time of publication. |
| Open Datasets | No | We take d = 15, n = 6, and consider instance (2) with randomly generated Ω s and x1 s, for a dataset with 212 elements, that we split into train, validation, and test sets with respective sizes of 60%, 20%, and 20% of the original dataset. The dataset used was generated by the authors ('randomly generated') and no public access information (link, DOI, repository, or citation) is provided. |
| Dataset Splits | Yes | We take d = 15, n = 6, and consider instance (2) with randomly generated Ω s and x1 s, for a dataset with 212 elements, that we split into train, validation, and test sets with respective sizes of 60%, 20%, and 20% of the original dataset. |
| Hardware Specification | No | The paper does not provide any specific details about the hardware used to run the experiments, such as GPU or CPU models. |
| Software Dependencies | No | The paper mentions using 'Adam (Kingma & Ba, 2014)' for training but does not provide specific version numbers for any software, libraries, or frameworks used (e.g., PyTorch, TensorFlow, Python version). |
| Experiment Setup | Yes | We take d = 15, n = 6, and consider instance (2)... We train the model using Adam (Kingma & Ba, 2014) on the Mean Squared Error (MSE) loss for next-token prediction on sequences of length T = 100... We train for 5000 epochs with early stopping. |