reproducibilityindex.ai

ChatQA: Surpassing GPT-4 on Conversational QA and RAG

Authors: Zihan Liu, Wei Ping, Rajarshi Roy, Peng Xu, Chankyu Lee, Mohammad Shoeybi, Bryan Catanzaro

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this work, we introduce Chat QA, a suite of models that outperform GPT-4 on retrieval-augmented generation (RAG) and conversational question answering (QA). ... We also present the CHATRAG BENCH, which encompasses ten datasets covering comprehensive evaluations on RAG, table-related QA, arithmetic calculations, and scenarios involving unanswerable questions. Our Chat QA-1.0-70B (score: 54.14), built on Llama2, a weaker foundation model than GPT-4, can slightly outperform GPT-4-0613 (score: 53.90) and GPT-4-Turbo-2024-04-09 (score: 54.03) on the CHATRAG BENCH, without relying on any synthetic data from Open AI GPT models. Notably, Llama3Chat QA-1.5-70B model surpasses the accuracy of GPT-4-Turbo-2024-04-09 by a margin.
Researcher Affiliation	Industry	Zihan Liu 1 Wei Ping 1 Rajarshi Roy 1 Peng Xu 1 Chankyu Lee 1 Mohammad Shoeybi 1 Bryan Catanzaro 1 Correspondence to: Zihan Liu <zihanl@nvidia.com>, Wei Ping <wping@nvidia.com>
Pseudocode	No	The paper describes methods in text and uses diagrams, but does not include structured pseudocode or algorithm blocks.
Open Source Code	Yes	To advance research in this field, we open-sourced the model weights, instruction tuning data, CHATRAG BENCH, and retriever for the community: https://chatqa-project.github.io/.
Open Datasets	Yes	To construct a large and comprehensive supervised fine-tuning (SFT) dataset, we follow Xu et al. (2023b), Wang et al. (2024) and gather a combined set of 128K SFT samples from high-quality instruction tuning datasets. It consists of 1) a social dialogue dataset Soda (Kim et al., 2022), 2) a long-form QA dataset ELI5 containing elaborate answers (Fan et al., 2019), 3) FLAN and chain-of-thought datasets (Wei et al., 2022b; Chung et al., 2022; Longpre et al., 2023), 4) LLM synthetic instruction tuning datasets, including Self-Instruct (Wang et al., 2022b) and Unnatural Instructions (Honovich et al., 2022), and 5) a private crowd-sourced conversational dataset, as well as two public human-written conversation datasets: Open Assistant (Köpf et al., 2023), and Dolly (Conover et al., 2023a). Finally, the training blend for stage-2 consists of: 1) A conversational QA dataset: Human Annotated Conv QA or Synthetic Conv QA, 2 2) single-turn QA datasets: DROP (Dua et al., 2019), Narrative QA (Koˇcisk y et al., 2018), Quoref (Dasigi et al., 2019), ROPES (Lin et al., 2019), SQu AD1.1 (Rajpurkar et al., 2016), SQu AD2.0 (Rajpurkar et al., 2018), News QA (Trischler et al., 2017), TATQA (Zhu et al., 2021), and 3) all of SFT datasets from stage-1. As for the training of Llama3-Chat QA1.5, we additionally incorporate Hybri Dial (Nakamura et al., 2022) and our collected around 2K QA pairs within the financial domain to further improve our model s capability in tabular understanding and arithmetic calculations.
Dataset Splits	Yes	We use the validation set of Qu AC for the evaluation since its test set cannot be directly obtained. Its validation set consists of 1000 dialogs with 7354 user-agent turns. ... We use the validation set of Topi OCQA since its test set is not available yet. Its validation set consists of 205 dialogs with 2514 user-agent turns. ... We use the validation set of Co QA since its test set cannot be directly obtained. Its validation set consists of 500 dialogues with 7983 user-agent turns.
Hardware Specification	Yes	We use 256 NVIDIA A100 GPUs for training Chat QA-1.0-70B and Llama3-Chat QA-1.5-70B models, and it takes around three hours for stage-1 training and around six hours for the stage-2 training. We use 64 NVIDIA A100 GPUs for training Chat QA-1.0-7B and Llama3-Chat QA-1.5-8B models, and it takes around one and half hours for stage-1 training and around three hours for stage-2 training.
Software Dependencies	No	The paper does not explicitly list specific software dependencies with version numbers (e.g., Python, PyTorch, CUDA versions).
Experiment Setup	Yes	For all Chat QA models, in stage-1 SFT, we use a learning rate of 5e-6, and train 1000 iterations with a global batch size of 128; and in stage-2 instruction tuning, we use a learning rate of 3e-7, and train 3300 iterations with a global batch size of 64.