Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
On Giant's Shoulders: Effortless Weak to Strong by Dynamic Logits Fusion
Authors: Chenghao Fan, Zhenyi Lu, Wei Wei, Jie Tian, Xiaoye Qu, Dangyang Chen, Yu Cheng
NeurIPS 2024 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We conduct extensive experiments across various benchmarks in both single-task and multi-task settings, achieving leading results. |
| Researcher Affiliation | Collaboration | School of Computer Science & Technology, Huazhong University of Science and Technology, Joint Laboratory of HUST and Pingan Property & Casualty Research (HPL), Ping An Property & Casualty Insurance Company of China, Ltd., The Chinese University of Hong Kong |
| Pseudocode | Yes | Our algorithm is outlined in pseudo-code in Algorithm 1 in Appendix F. |
| Open Source Code | Yes | Our paper uses publicly available datasets and provides the complete code and execution scripts in the supplementary material. |
| Open Datasets | Yes | We evaluate on the following datasets: mathmetical reasoning (GSM8K [8]); factual accuracy (Truthful QA [32]); realistic knowledge (Trivia QA [21]); multi-domain general knowledge (MMLU benchmark [13]); summarization (CNN-Daily Mail (CNN/DM) [47]). |
| Dataset Splits | No | The paper states 'All datasets are tested using a 0-shot setting' and mentions training models on respective training sets, but does not provide specific train/validation/test dataset splits or percentages required for reproduction. |
| Hardware Specification | Yes | All experiments are performed on H100 GPUs. |
| Software Dependencies | No | The paper mentions using 'VLLM' for inference but does not provide specific version numbers for it or any other key software dependencies like programming languages or deep learning frameworks. |
| Experiment Setup | Yes | For full fine-tuning, we set the batch size to 128, learning rate to 2e-5, optimizer to Adam. For Lo RA tuning, we set the rank to 64, learning rate to 1e-4, optimizer to Adam. We train for 3 epochs. During inference, we use greed decoding and set batch size to 256, top_p to 1.0 and temperature to 0.05. |