reproducibilityindex.ai

Dynamic Evaluation of Large Language Models by Meta Probing Agents

Authors: Kaijie Zhu, Jindong Wang, Qinlin Zhao, Ruochen Xu, Xing Xie

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We conducted extensive evaluations using MPA and found that most LLMs achieve poorer performance, indicating room for improvement. We conducted extensive evaluations and analysis on popular LLMs: GPT-4Turbo, GPT-3.5-Turbo, Gemini-Pro (Gemini Team, 2023), Llama2-70b-chat (Touvron et al., 2023), Yi-34b-chat (01-ai, 2024), and Mixtral-8x7b-Instruct (Mistral AITeam, 2023).
Researcher Affiliation	Collaboration	Kaijie Zhu 1 Jindong Wang 1 Qinlin Zhao 2 Ruochen Xu 1 Xing Xie 1 1Microsoft Research 2University of Science and Technology of China.
Pseudocode	No	The paper describes the Meta Probing Agents workflow using a diagram in Figure 2(b) and descriptive text, but it does not include formal pseudocode or an algorithm block.
Open Source Code	Yes	Code is available at: https://github.com/ microsoft/promptbench.
Open Datasets	Yes	We selected four popular datasets for evaluation: MMLU (Hendrycks et al., 2021), ARC-Challenge (ARC-C) (Clark et al., 2018), GSM8K (Cobbe et al., 2021), and Big Bench-Hard (BBH) (Suzgun et al., 2022; Srivastava et al., 2022)
Dataset Splits	No	The paper mentions using 'test sets' for evaluation and 'training split' for data augmentation, but it does not specify the exact percentages or counts for training, validation, and test splits for reproducibility.
Hardware Specification	No	The paper mentions models like GPT-4-Turbo and Gemini-Pro being used, and details some settings like generation temperature and token length, but it does not specify any hardware details such as GPU models, CPU types, or memory used for running the experiments.
Software Dependencies	No	The paper does not provide specific software dependencies with version numbers (e.g., Python, PyTorch, TensorFlow, scikit-learn, with their corresponding versions) used for the experiments.
Experiment Setup	Yes	To ensure a standardized comparison, we set the generation temperature to 0 for all models, with the generation length as 1000 tokens. We utilized GPT-4-Turbo as probing and judging agents, with temperatures of 0.7 and 0, respectively. The maximum token generation for each agent is set as 1000.