Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

URLs Help, Topics Guide: Understanding Metadata Utility in LLM Training

Authors: Dongyang Fan, Vinko Sabolčec, Martin Jaggi

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this work, we conduct a systematic evaluation and find that not all metadata types contribute equally. Only URL context speeds up training, whereas quality scores and topic/format domain information offer no clear benefit. Furthermore, the improved downstream performances of URL conditioning emerge only when longer prompts are used at inference time. We conduct a comprehensive study on context-conditioned LLM pretraining, highlighting its potential benefits and limitations at both pretraining and inference stages (Section 3). Section 4 is titled 'Experiments' and details various experimental setups, evaluation benchmarks, and performance results.
Researcher Affiliation Academia Dongyang Fan Vinko Sabolˇcec Martin Jaggi EPFL, Switzerland EMAIL
Pseudocode No The paper describes mathematical formulas for the different sampling methods (Context-Free, Context-Conditioned, Context-Guided sampling) and general steps for context-conditioned pretraining and generation (Figure 1, Figure 2), but it does not present a clearly labeled pseudocode block or algorithm.
Open Source Code No We will make our training codes public when the paper is made public.
Open Datasets Yes We use Fine Web-Edu dataset [14], which is a high-quality English-only dataset. The dataset has a lot of meta data available, such as URL source, quality score and token counts per document. [14] Anton Lozhkov, Loubna Ben Allal, Leandro von Werra, and Thomas Wolf. Fineweb-edu: the finest collection of educational content, 2024. URL https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu.
Dataset Splits No The paper mentions evaluating models on general knowledge understanding using LM-Eval-Harness [15], which typically uses standard splits. However, for the Fine Web-Edu dataset used for pretraining, the paper only states 'Throughout training, we randomly sample 100B tokens' and a '90%:10% ratio' for mixing context-prepended and context-free documents, but does not explicitly specify a train/validation/test split for this dataset in the context of their own experiments.
Hardware Specification Yes Each of the model training run takes around 800 GPU hours on GH200(120GB).
Software Dependencies No The paper mentions using 'Nemo tokenizer', 'Megatron-LM framework [13]', 'Adam W optimizer [12]', 'LM-Eval-Harness [15]', and 'gpt-4o from Open AI API' but does not provide specific version numbers for these software components.
Experiment Setup Yes Model. We adopt the Llama model architecture [11] with 16 layers, a hidden size of 2048, a sequence length of 4096, and a batch size of 504 (resulting in 2.06 million tokens). The model has 1.5 billion parameters. We follow the Cosine learning schedule, applying 2000 warmup steps. Adam W optimizer is used with regularization strength 0.1 [12]. A max learning rate of 3e-4 is applied and cools down to learning rate 3e-5 at the end of training. To train the models, we use the Megatron-LM framework [13].