Cost-efficient Knowledge-based Question Answering with Large Language Models

Authors: Junnan Dong, Qinggang Zhang, Chuang Zhou, Hao Chen, Daochen Zha, Xiao Huang

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments showcase the superior performance of Coke, which moves the Pareto frontier with up to 20.89% saving of GPT-4 fees while achieving a 2.74% higher accuracy on the benchmark datasets.
Researcher Affiliation Academia Junnan Dong1, Qinggang Zhang1, Chuang Zhou1, Hao Chen1 , Daochen Zha2, Xiao Huang1 1 The Hong Kong Polytechnic University 2 Rice University
Pseudocode No The paper provides mathematical formulations and descriptions of the algorithm but does not include a clearly labeled pseudocode or algorithm block.
Open Source Code Yes To contribute and inspire more valuable research in the community, we have open-sourced our main codes for reproducibility. The codes could be found from this anonymous link: https://anonymous.4open.science/r/Neur IPS-24-Coke-Anonymous13626/main.py
Open Datasets Yes We conduct experiments on three domain-specific datasets: (i) Commonsense knowledge domain: Commonsense QA [35]; (ii) Scientific Openbook domain: Openbook QA [28]; (iii) Medical Domain: Med QA-USMLE [23].
Dataset Splits Yes Table 1: Performance comparison among state-of-the-art baselines and Coke on three benchmark datasets in terms of both inferential accuracy and cost saving ($ API fees). Model Commonsense QA Open Book QA Med QA IHdev-Acc. IHtest-Acc. Dev-Acc. Test-Acc. Dev-Acc. Test-Acc.
Hardware Specification Yes To accelerate the matrix computation, we adopt Torch to boost the selection on an NVIDIA Ge Force RTX 4090 GPU.
Software Dependencies No The paper mentions using 'Torch' but does not provide specific version numbers for it or any other software dependencies required for reproducibility.
Experiment Setup Yes In this subsection, we conduct a detailed analysis of the important hyperparameters, i.e., λ and B. We decrease the budget from 1 to 0.5 until Coke has a higher error rate than GPT-4 B {0.5,0.6,0.7...,1}.