Chain-of-Thought Improves Text Generation with Citations in Large Language Models
Authors: Bin Ji, Huijun Liu, Mingzhe Du, See-Kiong Ng
AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We conduct experiments on the ALCE benchmark with six open-source LLMs. Experimental results demonstrate that: (1) the Co T prompting strategy significantly improves the quality of text generation with citations; (2) the Citation Insurance Mechanism delivers impressive gains in citation quality at a low cost; (3) our best approach performs comparably as previous best Chat GPT-based baselines. |
| Researcher Affiliation | Academia | Bin Ji, Huijun Liu*, Mingzhe Du, See-Kiong Ng National University of Singapore {jibin, mingzhe, seekiong}@nus.edu.sg, liuhuijun01@gmail.com |
| Pseudocode | Yes | Algorithm 1: Citation Insurance Mechanism |
| Open Source Code | Yes | The Appendix section can be found in the full paper at https://github.com/jibin5167/ALCE-CoT. |
| Open Datasets | Yes | The ALCE Benchmark ALCE (Gao et al. 2023) is the first reproducible benchmark for automatically evaluating LLMs text generation with citations and allows for multiple citations for individual statements. It includes three datasets, i.e., ASQA (Stelmakh et al. 2022), QAMPARI (Rubin et al. 2022), and ELI5 (Fan et al. 2019). |
| Dataset Splits | Yes | The ALCE Benchmark ALCE (Gao et al. 2023) is the first reproducible benchmark for automatically evaluating LLMs text generation with citations... It pre-defines three automatic evaluation metrics, i.e., Fluency, Correctness (Correct.), and Citation Quality. |
| Hardware Specification | Yes | We use four NVIDIA A100 40GB GPUs to evaluate our approach. Specifically, we use one GPU to run 13B LLMs, two GPUs to run 33B LLMs, and four GPUs to run the 70B LLM. |
| Software Dependencies | No | The paper mentions using open-source LLMs like LLaMA-2 and IR methods like GTR and BM25, but it does not specify software dependencies with version numbers (e.g., Python, PyTorch, CUDA versions) required for replication. |
| Experiment Setup | Yes | For all experiments, We set the seed to 42, which is the default setting of ALCE. For each prompting strategy, we evaluate our approach on the six LLMs by setting the temperature value to 0.001, 0.1, 0.3, 0.5, 0.7, 0.9, and 1, respectively. For each type of experiment, we average the results of different temperature settings and report the averaged performance. |