Diff-eRank: A Novel Rank-Based Metric for Evaluating Large Language Models

Authors: Lai Wei, Zhiquan Tan, Chenghai Li, Jindong Wang, Weiran Huang

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental To verify the effectiveness of our approach, we conduct experiments on the contexts of both uni-modal LLMs and multi-modal LLMs.
Researcher Affiliation Academia Lai Wei1, Zhiquan Tan2, Chenghai Li4 Jindong Wang3 Weiran Huang1, 1 MIFA Lab, Qing Yuan Research Institute, SEIEE, Shanghai Jiao Tong University 2 Department of Mathematical Sciences, Tsinghua University 3 William & Mary 4 Independent
Pseudocode No The paper describes definitions and mathematical formulations but does not include any structured pseudocode or algorithm blocks.
Open Source Code Yes Our code is publicly available at https://github.com/waltonfuture/Diff-e Rank.
Open Datasets Yes Specifically, we consider including pre-training datasets such as Wikipedia [14] and openwebtext2 [15], instruction dataset dolly-15k [8], and preference dataset hh-rlhf [2] for the diversity of their usage.
Dataset Splits No The paper mentions using specific datasets for evaluation (e.g., 'evaluation set of openbookqa [22] and piqa [3]') and refers to 'random sampling 10 thousand pieces of data' for subset selection, but it does not provide explicit training, validation, and test splits with percentages, sample counts, or a detailed splitting methodology for all datasets used.
Hardware Specification Yes We conduct our experiments using NVIDIA A800-80G GPUs.
Software Dependencies No The paper mentions software like PyTorch (implied by the nature of LLMs), but it does not specify explicit version numbers for any software dependencies needed for replication.
Experiment Setup No The paper describes model choices (e.g., OPT family, LLaVA-1.5, MiniGPT-v2) and datasets, but does not provide specific hyperparameters (e.g., learning rate, batch size, optimizer) or detailed system-level training settings needed to reproduce the experimental setup.