PandaLM: An Automatic Evaluation Benchmark for LLM Instruction Tuning Optimization
Authors: Yidong Wang, Zhuohao Yu, Wenjin Yao, Zhengran Zeng, Linyi Yang, Cunxiang Wang, Hao Chen, Chaoya Jiang, Rui Xie, Jindong Wang, Xing Xie, Wei Ye, Shikun Zhang, Yue Zhang
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | On evaluations using our collected test dataset, our findings reveal that Panda LM-7B offers performance comparable to both GPT-3.5 and GPT4. Impressively, Panda LM-70B surpasses their performance. |
| Researcher Affiliation | Collaboration | Yidong Wang1,2 , Zhuohao Yu1 , Wenjin Yao1, Zhengran Zeng1, Linyi Yang2, Cunxiang Wang2, Hao Chen3, Chaoya Jiang1 , Rui Xie1, Jindong Wang3, Xing Xie3, Wei Ye1 , Shikun Zhang1 , Yue Zhang2 1Peking University 2Westlake University 3Microsoft Research Asia |
| Pseudocode | No | Appendix A (Figure 5 and 6) illustrates a training data example and the prompt for training Panda LM, but these are examples of data structure and a prompt, not a structured pseudocode or algorithm block detailing the method's steps. |
| Open Source Code | No | The paper states, 'Panda LM introduces unique advantages that are not present in models like GPT-3.5 and GPT-4. It offers open-source availability, enabling reproducibility, and protecting data privacy.' However, it does not provide a direct link to the code repository or an explicit statement confirming the release of their code for the described methodology. |
| Open Datasets | Yes | The instructions and inputs in the input tuple are sampled from the Alpaca 52K dataset (Taori et al., 2023). [...] The test data is sampled from the diverse human evaluation dataset of self-instruct (Wang et al., 2022c). [...] Specifically, we assess Panda LM s proficiency using the LSAT (Law School Admission Test) dataset, which serves as an entrance exam question set for American law schools. [...] In the realm of biomedicine, we use the Pub Med QA dataset [...] Additionally, we tap into the Bio ASQ dataset |
| Dataset Splits | Yes | The assessment is conducted on a validation set comprising 170 distinct instructions and inputs obtained from our 1K test set introduced in Section 4. |
| Hardware Specification | Yes | We train Panda LM with the Deep Speed (Rasley et al., 2020) library, and Zero Redundancy Optimizer (Ze RO) (Rajbhandari et al., 2020; Ren et al., 2021) Stage 2, on 8 NVIDIA A100-SXM4-80GB GPUs. |
| Software Dependencies | No | The paper mentions key software components like 'Deep Speed library', 'Zero Redundancy Optimizer (Ze RO) Stage 2', and 'Adam W optimizer' but does not specify their version numbers, which are necessary for full reproducibility. |
| Experiment Setup | Yes | Regarding the training hyperparameters, we apply the Adam W (Loshchilov & Hutter, 2017) optimizer with a learning rate of 2e-5 and a cosine learning rate scheduler. The model is trained for 2 epochs. The training process uses a warmup ratio of 0.03 to avoid large gradients at the beginning of training. We use a batch size of 2 per GPU with all inputs truncated to a maximum of 1024 tokens and employ a gradient accumulation strategy with 8 steps. [...] Specifically, we explore checkpoints from each epoch (ranging from epoch 1 to epoch 5), four different learning rates (2e-6, 1e-5, 2e-5, 2e-4), two types of optimizers (SGD (Goodfellow et al., 2016) and Adam W), and two learning rate schedulers (cosine and linear). |