Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
URLs Help, Topics Guide: Understanding Metadata Utility in LLM Training
Authors: Dongyang Fan, Vinko Sabolčec, Martin Jaggi
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this work, we conduct a systematic evaluation and find that not all metadata types contribute equally. Only URL context speeds up training, whereas quality scores and topic/format domain information offer no clear benefit. Furthermore, the improved downstream performances of URL conditioning emerge only when longer prompts are used at inference time. We conduct a comprehensive study on context-conditioned LLM pretraining, highlighting its potential benefits and limitations at both pretraining and inference stages (Section 3). Section 4 is titled 'Experiments' and details various experimental setups, evaluation benchmarks, and performance results. |
| Researcher Affiliation | Academia | Dongyang Fan Vinko Sabolˇcec Martin Jaggi EPFL, Switzerland EMAIL |
| Pseudocode | No | The paper describes mathematical formulas for the different sampling methods (Context-Free, Context-Conditioned, Context-Guided sampling) and general steps for context-conditioned pretraining and generation (Figure 1, Figure 2), but it does not present a clearly labeled pseudocode block or algorithm. |
| Open Source Code | No | We will make our training codes public when the paper is made public. |
| Open Datasets | Yes | We use Fine Web-Edu dataset [14], which is a high-quality English-only dataset. The dataset has a lot of meta data available, such as URL source, quality score and token counts per document. [14] Anton Lozhkov, Loubna Ben Allal, Leandro von Werra, and Thomas Wolf. Fineweb-edu: the finest collection of educational content, 2024. URL https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu. |
| Dataset Splits | No | The paper mentions evaluating models on general knowledge understanding using LM-Eval-Harness [15], which typically uses standard splits. However, for the Fine Web-Edu dataset used for pretraining, the paper only states 'Throughout training, we randomly sample 100B tokens' and a '90%:10% ratio' for mixing context-prepended and context-free documents, but does not explicitly specify a train/validation/test split for this dataset in the context of their own experiments. |
| Hardware Specification | Yes | Each of the model training run takes around 800 GPU hours on GH200(120GB). |
| Software Dependencies | No | The paper mentions using 'Nemo tokenizer', 'Megatron-LM framework [13]', 'Adam W optimizer [12]', 'LM-Eval-Harness [15]', and 'gpt-4o from Open AI API' but does not provide specific version numbers for these software components. |
| Experiment Setup | Yes | Model. We adopt the Llama model architecture [11] with 16 layers, a hidden size of 2048, a sequence length of 4096, and a batch size of 504 (resulting in 2.06 million tokens). The model has 1.5 billion parameters. We follow the Cosine learning schedule, applying 2000 warmup steps. Adam W optimizer is used with regularization strength 0.1 [12]. A max learning rate of 3e-4 is applied and cools down to learning rate 3e-5 at the end of training. To train the models, we use the Megatron-LM framework [13]. |