DyVal: Dynamic Evaluation of Large Language Models for Reasoning Tasks

Authors: Kaijie Zhu, Jiaao Chen, Jindong Wang, Neil Zhenqiang Gong, Diyi Yang, Xing Xie

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate various LLMs ranging from Flan-T5-large to GPT-3.5-Turbo and GPT-4. Experiments show that LLMs perform worse in DYVAL-generated evaluation samples with different complexities, highlighting the significance of dynamic evaluation. We conduct extensive experiments to provide insights for evaluating and improving LLMs.
Researcher Affiliation Collaboration Kaijie Zhu1 , Jiaao Chen2 , Jindong Wang1 , Neil Zhenqiang Gong3, Diyi Yang4, Xing Xie1 1Microsoft Research, 2Georgia Tech, 3Duke University, 4Stanford University
Pseudocode No The paper describes algorithms and processes in narrative text and figures, but does not include any formal pseudocode or algorithm blocks.
Open Source Code Yes Code is available at: https://github.com/microsoft/promptbench.
Open Datasets Yes To further demonstrate the effectiveness of our generated data, we test the models with few-shot examples on existing benchmarks including GSM8K (Cobbe et al., 2021) and SVAMP (Patel et al., 2021) to evaluate math abilities, FOLIO (Han et al., 2022) and RACO (bench authors, 2023) to evaluate the logical reasoning abilities, and DP (Dziri et al., 2023) and LCS (bench authors, 2023) to evaluate the algorithm abilities.
Dataset Splits No The paper mentions 'training data' and 'testing data' (or 'test sets') with specific sample counts and difficulty levels but does not explicitly define or refer to a 'validation' split for its own experiments or generated datasets.
Hardware Specification Yes All experiments are conducted on a workstation equipped with an NVIDIA V100 GPU with 16GB memory and A100 GPU with 80GB memory.
Software Dependencies No The paper mentions specific OpenAI API versions ('gpt-3.5-turbo-0613' and 'gpt-4-0613') and states 'All implementations are based on Huggingface', but it does not provide version numbers for general software dependencies like Python, PyTorch, or the Huggingface library itself.
Experiment Setup Yes Temperature is set to 0 to avoid randomness. We set the generation length to be directly proportional to the input length. Specifically, for GPT-3.5-Turbo and GPT-4, the generate length is set to be twice the input length; for the remaining models, it is set to be five times the input length. We fine-tuned Llama2-13b-chat with LORA (Hu et al., 2022) for 3 epochs where the rank was 8, the scaling factor was 16 and the drop out rate was 0.05. We used a 0.0003 learning rate with batch size 128.