Chain-of-Thought Improves Text Generation with Citations in Large Language Models

Authors: Bin Ji, Huijun Liu, Mingzhe Du, See-Kiong Ng

AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We conduct experiments on the ALCE benchmark with six open-source LLMs. Experimental results demonstrate that: (1) the Co T prompting strategy significantly improves the quality of text generation with citations; (2) the Citation Insurance Mechanism delivers impressive gains in citation quality at a low cost; (3) our best approach performs comparably as previous best Chat GPT-based baselines.
Researcher Affiliation Academia Bin Ji, Huijun Liu*, Mingzhe Du, See-Kiong Ng National University of Singapore {jibin, mingzhe, seekiong}@nus.edu.sg, liuhuijun01@gmail.com
Pseudocode Yes Algorithm 1: Citation Insurance Mechanism
Open Source Code Yes The Appendix section can be found in the full paper at https://github.com/jibin5167/ALCE-CoT.
Open Datasets Yes The ALCE Benchmark ALCE (Gao et al. 2023) is the first reproducible benchmark for automatically evaluating LLMs text generation with citations and allows for multiple citations for individual statements. It includes three datasets, i.e., ASQA (Stelmakh et al. 2022), QAMPARI (Rubin et al. 2022), and ELI5 (Fan et al. 2019).
Dataset Splits Yes The ALCE Benchmark ALCE (Gao et al. 2023) is the first reproducible benchmark for automatically evaluating LLMs text generation with citations... It pre-defines three automatic evaluation metrics, i.e., Fluency, Correctness (Correct.), and Citation Quality.
Hardware Specification Yes We use four NVIDIA A100 40GB GPUs to evaluate our approach. Specifically, we use one GPU to run 13B LLMs, two GPUs to run 33B LLMs, and four GPUs to run the 70B LLM.
Software Dependencies No The paper mentions using open-source LLMs like LLaMA-2 and IR methods like GTR and BM25, but it does not specify software dependencies with version numbers (e.g., Python, PyTorch, CUDA versions) required for replication.
Experiment Setup Yes For all experiments, We set the seed to 42, which is the default setting of ALCE. For each prompting strategy, we evaluate our approach on the six LLMs by setting the temperature value to 0.001, 0.1, 0.3, 0.5, 0.7, 0.9, and 1, respectively. For each type of experiment, we average the results of different temperature settings and report the averaged performance.