Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

PandaLM: An Automatic Evaluation Benchmark for LLM Instruction Tuning Optimization

Authors: Yidong Wang, Zhuohao Yu, Wenjin Yao, Zhengran Zeng, Linyi Yang, Cunxiang Wang, Hao Chen, Chaoya Jiang, Rui Xie, Jindong Wang, Xing Xie, Wei Ye, Shikun Zhang, Yue Zhang

ICLR 2024 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	On evaluations using our collected test dataset, our findings reveal that Panda LM-7B offers performance comparable to both GPT-3.5 and GPT4. Impressively, Panda LM-70B surpasses their performance.
Researcher Affiliation	Collaboration	Yidong Wang1,2 , Zhuohao Yu1 , Wenjin Yao1, Zhengran Zeng1, Linyi Yang2, Cunxiang Wang2, Hao Chen3, Chaoya Jiang1 , Rui Xie1, Jindong Wang3, Xing Xie3, Wei Ye1 , Shikun Zhang1 , Yue Zhang2 1Peking University 2Westlake University 3Microsoft Research Asia
Pseudocode	No	Appendix A (Figure 5 and 6) illustrates a training data example and the prompt for training Panda LM, but these are examples of data structure and a prompt, not a structured pseudocode or algorithm block detailing the method's steps.
Open Source Code	No	The paper states, 'Panda LM introduces unique advantages that are not present in models like GPT-3.5 and GPT-4. It offers open-source availability, enabling reproducibility, and protecting data privacy.' However, it does not provide a direct link to the code repository or an explicit statement confirming the release of their code for the described methodology.
Open Datasets	Yes	The instructions and inputs in the input tuple are sampled from the Alpaca 52K dataset (Taori et al., 2023). [...] The test data is sampled from the diverse human evaluation dataset of self-instruct (Wang et al., 2022c). [...] Specifically, we assess Panda LM s proficiency using the LSAT (Law School Admission Test) dataset, which serves as an entrance exam question set for American law schools. [...] In the realm of biomedicine, we use the Pub Med QA dataset [...] Additionally, we tap into the Bio ASQ dataset
Dataset Splits	Yes	The assessment is conducted on a validation set comprising 170 distinct instructions and inputs obtained from our 1K test set introduced in Section 4.
Hardware Specification	Yes	We train Panda LM with the Deep Speed (Rasley et al., 2020) library, and Zero Redundancy Optimizer (Ze RO) (Rajbhandari et al., 2020; Ren et al., 2021) Stage 2, on 8 NVIDIA A100-SXM4-80GB GPUs.
Software Dependencies	No	The paper mentions key software components like 'Deep Speed library', 'Zero Redundancy Optimizer (Ze RO) Stage 2', and 'Adam W optimizer' but does not specify their version numbers, which are necessary for full reproducibility.
Experiment Setup	Yes	Regarding the training hyperparameters, we apply the Adam W (Loshchilov & Hutter, 2017) optimizer with a learning rate of 2e-5 and a cosine learning rate scheduler. The model is trained for 2 epochs. The training process uses a warmup ratio of 0.03 to avoid large gradients at the beginning of training. We use a batch size of 2 per GPU with all inputs truncated to a maximum of 1024 tokens and employ a gradient accumulation strategy with 8 steps. [...] Specifically, we explore checkpoints from each epoch (ranging from epoch 1 to epoch 5), four different learning rates (2e-6, 1e-5, 2e-5, 2e-4), two types of optimizers (SGD (Goodfellow et al., 2016) and Adam W), and two learning rate schedulers (cosine and linear).