Cost-efficient Knowledge-based Question Answering with Large Language Models
Authors: Junnan Dong, Qinggang Zhang, Chuang Zhou, Hao Chen, Daochen Zha, Xiao Huang
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments showcase the superior performance of Coke, which moves the Pareto frontier with up to 20.89% saving of GPT-4 fees while achieving a 2.74% higher accuracy on the benchmark datasets. |
| Researcher Affiliation | Academia | Junnan Dong1, Qinggang Zhang1, Chuang Zhou1, Hao Chen1 , Daochen Zha2, Xiao Huang1 1 The Hong Kong Polytechnic University 2 Rice University |
| Pseudocode | No | The paper provides mathematical formulations and descriptions of the algorithm but does not include a clearly labeled pseudocode or algorithm block. |
| Open Source Code | Yes | To contribute and inspire more valuable research in the community, we have open-sourced our main codes for reproducibility. The codes could be found from this anonymous link: https://anonymous.4open.science/r/Neur IPS-24-Coke-Anonymous13626/main.py |
| Open Datasets | Yes | We conduct experiments on three domain-specific datasets: (i) Commonsense knowledge domain: Commonsense QA [35]; (ii) Scientific Openbook domain: Openbook QA [28]; (iii) Medical Domain: Med QA-USMLE [23]. |
| Dataset Splits | Yes | Table 1: Performance comparison among state-of-the-art baselines and Coke on three benchmark datasets in terms of both inferential accuracy and cost saving ($ API fees). Model Commonsense QA Open Book QA Med QA IHdev-Acc. IHtest-Acc. Dev-Acc. Test-Acc. Dev-Acc. Test-Acc. |
| Hardware Specification | Yes | To accelerate the matrix computation, we adopt Torch to boost the selection on an NVIDIA Ge Force RTX 4090 GPU. |
| Software Dependencies | No | The paper mentions using 'Torch' but does not provide specific version numbers for it or any other software dependencies required for reproducibility. |
| Experiment Setup | Yes | In this subsection, we conduct a detailed analysis of the important hyperparameters, i.e., λ and B. We decrease the budget from 1 to 0.5 until Coke has a higher error rate than GPT-4 B {0.5,0.6,0.7...,1}. |