Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Drag-and-Drop LLMs: Zero-Shot Prompt-to-Weights
Authors: Zhiyuan Liang, Dongwen Tang, Yuhao Zhou, Xuanlei Zhao, Mingjia Shi, Wangbo Zhao, Zekai Li, Peihao Wang, Konstantin Schürholt, Damian Borth, Michael Bronstein, Yang You, Zhangyang "Atlas" Wang, Kai Wang
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | 3 Experiments 3.1 Implementation Details We choose Qwen2.5 [63] series as foundation model and conduct experiments on common sense reasoning, coding, math, and multimodal tasks. ... We report the average accuracy of training Lo RAs and our generated ones in Table 1. ... Ablation Studies ... The empirical results and findings are as belows. |
| Researcher Affiliation | Academia | 1National University of Singapore, 2UT Austin, 3University of St. Gallen, 4Oxford University |
| Pseudocode | No | The paper describes the methodology and architecture through textual descriptions and diagrams (e.g., Figure 3: Block details of parameter generator), but it does not contain a dedicated pseudocode or algorithm block. |
| Open Source Code | Yes | Our project is available at https://jerryliang24.github.io/Dn D. |
| Open Datasets | Yes | task #model size (B) datasets common sense 0.5 ARC-e [11], ARC-c [11], Bool Q [10], OBQA [41], Hela Swag [65], PIQA [6], Wino Grande [50] coding 1.5, 7 Evol-Instruct-68K-V1 [40], Glaive-Assistant-V2 [18], Python-Codes-25K [16], Code-74k-Share GPT [2], Rosetta-Code [13], LLa MA-Python-Codes-30K [15], Code Alpaca-20K [8] math 1.5 Competition-Math[24], Math-QA[3], Math-IIO-68K-Mini [45] Math-Plus [56], Mu-Math [64], To T-Math-V1 [42] multimodal 3 Math V360K [55] |
| Dataset Splits | Yes | In every column of Table 1, we use the specified dataset as test set (i.e., not used in training) and train Dn D on other datasets Lo RAs. ... Train-test set arrangements are: 6-1, 4-3, 3-4 and 2-5. |
| Hardware Specification | Yes | We show the cost of generating one single model on a NVIDIA A100 80G GPU in Table 15. ... All metrics are measured on a single NVIDIA A100 80G GPU. |
| Software Dependencies | No | The paper mentions software components like 'optimizer Adam W' and specific models like 'Sentence-BERT [47]' and 'Qwen2.5', but it does not provide specific version numbers for these software dependencies or libraries required to replicate the experiment. |
| Experiment Setup | Yes | In this section, we provide details of our training recipe and various hyper-parameter settings. We incorporate multiple tasks in language models, each involves different foundation model sizes, different generator architecture, and training schedules. We report settings for every task in Table 8. Table 8: Training recipe for different tasks in Section 3.2 and Section 3.3. ... batch size, optimizer, learning rate, length of prompt batch, training step, weight decay, max grad norm, noise aug. amplitude |