reproducibilityindex.ai

Cost-efficient Knowledge-based Question Answering with Large Language Models

Authors: Junnan Dong, Qinggang Zhang, Chuang Zhou, Hao Chen, Daochen Zha, Xiao Huang

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments showcase the superior performance of Coke, which moves the Pareto frontier with up to 20.89% saving of GPT-4 fees while achieving a 2.74% higher accuracy on the benchmark datasets.
Researcher Affiliation	Academia	Junnan Dong1, Qinggang Zhang1, Chuang Zhou1, Hao Chen1 , Daochen Zha2, Xiao Huang1 1 The Hong Kong Polytechnic University 2 Rice University
Pseudocode	No	The paper provides mathematical formulations and descriptions of the algorithm but does not include a clearly labeled pseudocode or algorithm block.
Open Source Code	Yes	To contribute and inspire more valuable research in the community, we have open-sourced our main codes for reproducibility. The codes could be found from this anonymous link: https://anonymous.4open.science/r/Neur IPS-24-Coke-Anonymous13626/main.py
Open Datasets	Yes	We conduct experiments on three domain-specific datasets: (i) Commonsense knowledge domain: Commonsense QA [35]; (ii) Scientific Openbook domain: Openbook QA [28]; (iii) Medical Domain: Med QA-USMLE [23].
Dataset Splits	Yes	Table 1: Performance comparison among state-of-the-art baselines and Coke on three benchmark datasets in terms of both inferential accuracy and cost saving ($ API fees). Model Commonsense QA Open Book QA Med QA IHdev-Acc. IHtest-Acc. Dev-Acc. Test-Acc. Dev-Acc. Test-Acc.
Hardware Specification	Yes	To accelerate the matrix computation, we adopt Torch to boost the selection on an NVIDIA Ge Force RTX 4090 GPU.
Software Dependencies	No	The paper mentions using 'Torch' but does not provide specific version numbers for it or any other software dependencies required for reproducibility.
Experiment Setup	Yes	In this subsection, we conduct a detailed analysis of the important hyperparameters, i.e., λ and B. We decrease the budget from 1 to 0.5 until Coke has a higher error rate than GPT-4 B {0.5,0.6,0.7...,1}.