PandaLM: An Automatic Evaluation Benchmark for LLM Instruction Tuning Optimization

Authors: Yidong Wang, Zhuohao Yu, Wenjin Yao, Zhengran Zeng, Linyi Yang, Cunxiang Wang, Hao Chen, Chaoya Jiang, Rui Xie, Jindong Wang, Xing Xie, Wei Ye, Shikun Zhang, Yue Zhang

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental On evaluations using our collected test dataset, our findings reveal that Panda LM-7B offers performance comparable to both GPT-3.5 and GPT4. Impressively, Panda LM-70B surpasses their performance.
Researcher Affiliation Collaboration Yidong Wang1,2 , Zhuohao Yu1 , Wenjin Yao1, Zhengran Zeng1, Linyi Yang2, Cunxiang Wang2, Hao Chen3, Chaoya Jiang1 , Rui Xie1, Jindong Wang3, Xing Xie3, Wei Ye1 , Shikun Zhang1 , Yue Zhang2 1Peking University 2Westlake University 3Microsoft Research Asia
Pseudocode No Appendix A (Figure 5 and 6) illustrates a training data example and the prompt for training Panda LM, but these are examples of data structure and a prompt, not a structured pseudocode or algorithm block detailing the method's steps.
Open Source Code No The paper states, 'Panda LM introduces unique advantages that are not present in models like GPT-3.5 and GPT-4. It offers open-source availability, enabling reproducibility, and protecting data privacy.' However, it does not provide a direct link to the code repository or an explicit statement confirming the release of their code for the described methodology.
Open Datasets Yes The instructions and inputs in the input tuple are sampled from the Alpaca 52K dataset (Taori et al., 2023). [...] The test data is sampled from the diverse human evaluation dataset of self-instruct (Wang et al., 2022c). [...] Specifically, we assess Panda LM s proficiency using the LSAT (Law School Admission Test) dataset, which serves as an entrance exam question set for American law schools. [...] In the realm of biomedicine, we use the Pub Med QA dataset [...] Additionally, we tap into the Bio ASQ dataset
Dataset Splits Yes The assessment is conducted on a validation set comprising 170 distinct instructions and inputs obtained from our 1K test set introduced in Section 4.
Hardware Specification Yes We train Panda LM with the Deep Speed (Rasley et al., 2020) library, and Zero Redundancy Optimizer (Ze RO) (Rajbhandari et al., 2020; Ren et al., 2021) Stage 2, on 8 NVIDIA A100-SXM4-80GB GPUs.
Software Dependencies No The paper mentions key software components like 'Deep Speed library', 'Zero Redundancy Optimizer (Ze RO) Stage 2', and 'Adam W optimizer' but does not specify their version numbers, which are necessary for full reproducibility.
Experiment Setup Yes Regarding the training hyperparameters, we apply the Adam W (Loshchilov & Hutter, 2017) optimizer with a learning rate of 2e-5 and a cosine learning rate scheduler. The model is trained for 2 epochs. The training process uses a warmup ratio of 0.03 to avoid large gradients at the beginning of training. We use a batch size of 2 per GPU with all inputs truncated to a maximum of 1024 tokens and employ a gradient accumulation strategy with 8 steps. [...] Specifically, we explore checkpoints from each epoch (ranging from epoch 1 to epoch 5), four different learning rates (2e-6, 1e-5, 2e-5, 2e-4), two types of optimizers (SGD (Goodfellow et al., 2016) and Adam W), and two learning rate schedulers (cosine and linear).