Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Zero-Shot Performance Prediction for Probabilistic Scaling Laws
Authors: Viktoria Schram, Markus Hiller, Daniel Beck, Trevor Cohn
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We validate our framework on three small-scale NLP datasets with up to 30 LCs. These are obtained from nano GPT models, from bilingual translation using m BART and Transformer models, and from multilingual translation using M2M100 models of varying sizes. ... 5 Experiments |
| Researcher Affiliation | Academia | Viktoria Schram1 Markus Hiller1 Daniel Beck2 Trevor Cohn 1 1School of Computing and Information Systems, The University of Melbourne 2School of Computing Technologies, Royal Melbourne Institute of Technology EMAIL EMAIL |
| Pseudocode | No | The paper describes the methodologies and models used, such as Gaussian Process models, in narrative and mathematical forms without explicit pseudocode or algorithm blocks. |
| Open Source Code | No | The code for all experiments will be made publicly available. ... 5. Open access to data and code: All code as well as created data will be made publicly available upon acceptance, accompanied with detailed instructions of how to use code and data. |
| Open Datasets | Yes | We validate our framework on three small-scale NLP datasets with up to 30 LCs. These are obtained from nano GPT models, from bilingual translation using m BART and Transformer models, and from multilingual translation using M2M100 models of varying sizes. ... The bilingual dataset is created by fine-tuning m BART50 and training a Transformer on the EMNLP2021 dataset [65, 66, 67]. ... Consistent with the original shared task, we employ the Flores101 dataset [89] as the test set and report BLEU [90] and Chr F scores. ... The nano GPT models [63] on the 10B token split of the Fine Web-Edu dataset [75]. |
| Dataset Splits | Yes | We employ the train-test splits Quad and Tri, as illustrated in Figure 5. Additionally, we consider a train-test split containing only the learning curves of the largest five models in the test set, referred to as T1. |
| Hardware Specification | Yes | We train all nano GPT models [63] on the 10B token split of the Fine Web-Edu dataset [75] using four NVIDIA A100 (80GB) GPUs. ... Fine-tuning is performed on NVIDIA A100 GPUs... ... train on NVIDIA A100 GPUs. ... The Ma GP framework and DHGP model were trained and evaluated on an Intel Core i7-8565U CPU. Baseline methods were trained and evaluated on an Intel Xeon Gold 6448H CPU. |
| Software Dependencies | No | fine-tune the m BART50 model using the Hugging Face implementation [85]. ... we use the implementation provided by fairseq [87]... ... Input sequences are tokenised into subword units using Sentence Piece [88]. |
| Experiment Setup | Yes | A learning rate of 6 × 10−4 is applied with linear warm-up and cosine decay scheduling, a minimum learning rate of 5 × 10−4, and a total batch size of 524,288. ... Fine-tuning is performed on NVIDIA A100 GPUs for one epoch [86], with a learning rate of 5 × 10−5, a dropout rate of 0.1, maximum sequence lengths of 200 for both source and target texts, and a batch size of 10. ... Training is initiated with a learning rate of 1 × 10−3, a weight decay of 1 × 10−4, a dropout rate of 0.4, and a batch size of 32. ... We employ the Adam optimizer with parameters β1 = 0.90, β2 = 0.98, and a weight decay of 0.0001. The training objective is the label-smoothed cross-entropy criterion with label smoothing of 0.1. The initial learning rate is set to 0.0003, scheduled by the inverse square root learning rate scheduler with 2,500 warm-up steps. The batch size (number of tokens) is 4096 × 32 for the 175M model, and 2048 × 64 for both the 615M and 1.2B models. |