Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Cost-efficient Knowledge-based Question Answering with Large Language Models
Authors: Junnan Dong, Qinggang Zhang, Chuang Zhou, Hao Chen, Daochen Zha, Xiao Huang
NeurIPS 2024 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments showcase the superior performance of Coke, which moves the Pareto frontier with up to 20.89% saving of GPT-4 fees while achieving a 2.74% higher accuracy on the benchmark datasets. |
| Researcher Affiliation | Academia | Junnan Dong1, Qinggang Zhang1, Chuang Zhou1, Hao Chen1 , Daochen Zha2, Xiao Huang1 1 The Hong Kong Polytechnic University 2 Rice University |
| Pseudocode | No | The paper provides mathematical formulations and descriptions of the algorithm but does not include a clearly labeled pseudocode or algorithm block. |
| Open Source Code | Yes | To contribute and inspire more valuable research in the community, we have open-sourced our main codes for reproducibility. The codes could be found from this anonymous link: https://anonymous.4open.science/r/Neur IPS-24-Coke-Anonymous13626/main.py |
| Open Datasets | Yes | We conduct experiments on three domain-specific datasets: (i) Commonsense knowledge domain: Commonsense QA [35]; (ii) Scientific Openbook domain: Openbook QA [28]; (iii) Medical Domain: Med QA-USMLE [23]. |
| Dataset Splits | Yes | Table 1: Performance comparison among state-of-the-art baselines and Coke on three benchmark datasets in terms of both inferential accuracy and cost saving ($ API fees). Model Commonsense QA Open Book QA Med QA IHdev-Acc. IHtest-Acc. Dev-Acc. Test-Acc. Dev-Acc. Test-Acc. |
| Hardware Specification | Yes | To accelerate the matrix computation, we adopt Torch to boost the selection on an NVIDIA Ge Force RTX 4090 GPU. |
| Software Dependencies | No | The paper mentions using 'Torch' but does not provide specific version numbers for it or any other software dependencies required for reproducibility. |
| Experiment Setup | Yes | In this subsection, we conduct a detailed analysis of the important hyperparameters, i.e., λ and B. We decrease the budget from 1 to 0.5 until Coke has a higher error rate than GPT-4 B {0.5,0.6,0.7...,1}. |