Dynamic Evaluation of Large Language Models by Meta Probing Agents
Authors: Kaijie Zhu, Jindong Wang, Qinlin Zhao, Ruochen Xu, Xing Xie
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We conducted extensive evaluations using MPA and found that most LLMs achieve poorer performance, indicating room for improvement. We conducted extensive evaluations and analysis on popular LLMs: GPT-4Turbo, GPT-3.5-Turbo, Gemini-Pro (Gemini Team, 2023), Llama2-70b-chat (Touvron et al., 2023), Yi-34b-chat (01-ai, 2024), and Mixtral-8x7b-Instruct (Mistral AITeam, 2023). |
| Researcher Affiliation | Collaboration | Kaijie Zhu 1 Jindong Wang 1 Qinlin Zhao 2 Ruochen Xu 1 Xing Xie 1 1Microsoft Research 2University of Science and Technology of China. |
| Pseudocode | No | The paper describes the Meta Probing Agents workflow using a diagram in Figure 2(b) and descriptive text, but it does not include formal pseudocode or an algorithm block. |
| Open Source Code | Yes | Code is available at: https://github.com/ microsoft/promptbench. |
| Open Datasets | Yes | We selected four popular datasets for evaluation: MMLU (Hendrycks et al., 2021), ARC-Challenge (ARC-C) (Clark et al., 2018), GSM8K (Cobbe et al., 2021), and Big Bench-Hard (BBH) (Suzgun et al., 2022; Srivastava et al., 2022) |
| Dataset Splits | No | The paper mentions using 'test sets' for evaluation and 'training split' for data augmentation, but it does not specify the exact percentages or counts for training, validation, and test splits for reproducibility. |
| Hardware Specification | No | The paper mentions models like GPT-4-Turbo and Gemini-Pro being used, and details some settings like generation temperature and token length, but it does not specify any hardware details such as GPU models, CPU types, or memory used for running the experiments. |
| Software Dependencies | No | The paper does not provide specific software dependencies with version numbers (e.g., Python, PyTorch, TensorFlow, scikit-learn, with their corresponding versions) used for the experiments. |
| Experiment Setup | Yes | To ensure a standardized comparison, we set the generation temperature to 0 for all models, with the generation length as 1000 tokens. We utilized GPT-4-Turbo as probing and judging agents, with temperatures of 0.7 and 0, respectively. The maximum token generation for each agent is set as 1000. |