Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
From Bytes to Ideas: Language Modeling with Autoregressive U-Nets
Authors: Mathurin VIDEAU, Badr Youbi Idrissi, Alessandro Leite, Marc Schoenauer, Olivier Teytaud, David Lopez-Paz
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | 3 Experimental Results |
| Researcher Affiliation | Collaboration | Alessandro Leite INSA Rouen Normandy, LITIS Marc Schoenauer Olivier Teytaud Thales Cort AIx-Labs |
| Pseudocode | No | The paper describes methods and formulas but does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks, nor structured steps formatted like code. |
| Open Source Code | Yes | Code open-sourced at https://github.com/facebookresearch/lingua/tree/main/apps/aunet |
| Open Datasets | Yes | For all experiments, DCLM [13] served as the pretraining dataset, with a small portion held out for validation, totalling around 4T tokens (of GPTNeo XTokenizer). ... [13] Jeffrey Li, Alex Fang, Georgios Smyrnis, et al. Data Comp-LM: In search of the next generation of training sets for language models. In A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang, editors, Advances in Neural Information Processing Systems, volume 37, pages 14200 14282, 2024. |
| Dataset Splits | No | For all experiments, DCLM [13] served as the pretraining dataset, with a small portion held out for validation, totalling around 4T tokens (of GPTNeo XTokenizer). The paper mentions a validation split for the DCLM dataset but does not provide specific percentages or sample counts for the splits, nor does it explicitly state the use of standard predefined splits with citations for downstream tasks. |
| Hardware Specification | Yes | measured in bytes per second per GPU (bps) on H100 80GB GPUs (internal cluster) during the actual training. |
| Software Dependencies | No | Our experiments are run on a public code, namely meta Lingua [2], and the entire model is compiled with torch.compile. However, specific version numbers for software libraries like PyTorch are not provided. |
| Experiment Setup | Yes | Section 3.1 'Experimental Setup' describes the data used, baselines, AU-Net architecture details including embedding dimensionality, layer allocation, hidden dimensions, and contraction rates. Section 2.3 provides formulas for batch size and learning rate: 'BSZAU-Net(C) = 0.66 C0.321 LRAU-Net(C) = 6.6 C 0.176. And we run the same tuning for the BPE baseline, for which we ο¬nd: BSZBPE(C) = 29.9 C0.231 LRBPE(C) = 19.3 C 0.177.' |