DARG: Dynamic Evaluation of Large Language Models via Adaptive Reasoning Graph

Authors: Zhehao Zhang, Jiaao Chen, Diyi Yang

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We apply our DARG framework to diverse reasoning tasks in four domains with 15 state-of-the-art LLMs. Experimental results show that almost all LLMs experience a performance decrease with increased complexity and certain LLMs exhibit significant drops.
Researcher Affiliation Academia Zhehao Zhang Dartmouth College zhehao.zhang.gr@dartmouth.edu Jiaao Chen Georgia Institute of Technology jiaaochen@gate.edu Diyi Yang Stanford University diyiy@cs.stanford.edu
Pseudocode Yes Algorithm 1: Algorithm of DARG
Open Source Code Yes The code is available at https://github.com/SALT-NLP/DARG.
Open Datasets Yes For each of the tasks, we utilized the most used datasets, specifically, GSM8K [19] for math reasoning, BBQ [2] for social reasoning, BBH Navigate [91] dataset for spatial reasoning and BBH Dyck Language for symbolic reasoning
Dataset Splits Yes We construct a hold-out validation set, which contains 0.05% of the data points from each complexity dimension generated by DARG and others are used for training.
Hardware Specification Yes Other models are used locally on a machine with an Nvidia A100 40G GPU with 40G GPU memory and a 12-core CPU.
Software Dependencies Yes For fine-tuning and subsequent inference, we employ Lit GPT [3] along with its default hyperparameters (learning_rate=0.0003, weight_decay=0.02, beta1=0.9, beta2=0.95, max_norm=None, min_lr=6e-05, epochs=5) and Lo RA [37].
Experiment Setup Yes For graph construction and graph-to-text decoding, we set the number temperature to 1. For all evaluation experiments, we set the temperature to 0.1 to ensure reproducibility and the top_p to 0.95. For fine-tuning and subsequent inference, we employ Lit GPT [3] along with its default hyperparameters (learning_rate=0.0003, weight_decay=0.02, beta1=0.9, beta2=0.95, max_norm=None, min_lr=6e-05, epochs=5) and Lo RA [37].