reproducibilityindex.ai

Diff-eRank: A Novel Rank-Based Metric for Evaluating Large Language Models

Authors: Lai Wei, Zhiquan Tan, Chenghai Li, Jindong Wang, Weiran Huang

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	To verify the effectiveness of our approach, we conduct experiments on the contexts of both uni-modal LLMs and multi-modal LLMs.
Researcher Affiliation	Academia	Lai Wei1, Zhiquan Tan2, Chenghai Li4 Jindong Wang3 Weiran Huang1, 1 MIFA Lab, Qing Yuan Research Institute, SEIEE, Shanghai Jiao Tong University 2 Department of Mathematical Sciences, Tsinghua University 3 William & Mary 4 Independent
Pseudocode	No	The paper describes definitions and mathematical formulations but does not include any structured pseudocode or algorithm blocks.
Open Source Code	Yes	Our code is publicly available at https://github.com/waltonfuture/Diff-e Rank.
Open Datasets	Yes	Specifically, we consider including pre-training datasets such as Wikipedia [14] and openwebtext2 [15], instruction dataset dolly-15k [8], and preference dataset hh-rlhf [2] for the diversity of their usage.
Dataset Splits	No	The paper mentions using specific datasets for evaluation (e.g., 'evaluation set of openbookqa [22] and piqa [3]') and refers to 'random sampling 10 thousand pieces of data' for subset selection, but it does not provide explicit training, validation, and test splits with percentages, sample counts, or a detailed splitting methodology for all datasets used.
Hardware Specification	Yes	We conduct our experiments using NVIDIA A800-80G GPUs.
Software Dependencies	No	The paper mentions software like PyTorch (implied by the nature of LLMs), but it does not specify explicit version numbers for any software dependencies needed for replication.
Experiment Setup	No	The paper describes model choices (e.g., OPT family, LLaVA-1.5, MiniGPT-v2) and datasets, but does not provide specific hyperparameters (e.g., learning rate, batch size, optimizer) or detailed system-level training settings needed to reproduce the experimental setup.