Diff-eRank: A Novel Rank-Based Metric for Evaluating Large Language Models
Authors: Lai Wei, Zhiquan Tan, Chenghai Li, Jindong Wang, Weiran Huang
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | To verify the effectiveness of our approach, we conduct experiments on the contexts of both uni-modal LLMs and multi-modal LLMs. |
| Researcher Affiliation | Academia | Lai Wei1, Zhiquan Tan2, Chenghai Li4 Jindong Wang3 Weiran Huang1, 1 MIFA Lab, Qing Yuan Research Institute, SEIEE, Shanghai Jiao Tong University 2 Department of Mathematical Sciences, Tsinghua University 3 William & Mary 4 Independent |
| Pseudocode | No | The paper describes definitions and mathematical formulations but does not include any structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | Our code is publicly available at https://github.com/waltonfuture/Diff-e Rank. |
| Open Datasets | Yes | Specifically, we consider including pre-training datasets such as Wikipedia [14] and openwebtext2 [15], instruction dataset dolly-15k [8], and preference dataset hh-rlhf [2] for the diversity of their usage. |
| Dataset Splits | No | The paper mentions using specific datasets for evaluation (e.g., 'evaluation set of openbookqa [22] and piqa [3]') and refers to 'random sampling 10 thousand pieces of data' for subset selection, but it does not provide explicit training, validation, and test splits with percentages, sample counts, or a detailed splitting methodology for all datasets used. |
| Hardware Specification | Yes | We conduct our experiments using NVIDIA A800-80G GPUs. |
| Software Dependencies | No | The paper mentions software like PyTorch (implied by the nature of LLMs), but it does not specify explicit version numbers for any software dependencies needed for replication. |
| Experiment Setup | No | The paper describes model choices (e.g., OPT family, LLaVA-1.5, MiniGPT-v2) and datasets, but does not provide specific hyperparameters (e.g., learning rate, batch size, optimizer) or detailed system-level training settings needed to reproduce the experimental setup. |