reproducibilityindex.ai

Chain-of-Thought Improves Text Generation with Citations in Large Language Models

Authors: Bin Ji, Huijun Liu, Mingzhe Du, See-Kiong Ng

AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We conduct experiments on the ALCE benchmark with six open-source LLMs. Experimental results demonstrate that: (1) the Co T prompting strategy significantly improves the quality of text generation with citations; (2) the Citation Insurance Mechanism delivers impressive gains in citation quality at a low cost; (3) our best approach performs comparably as previous best Chat GPT-based baselines.
Researcher Affiliation	Academia	Bin Ji, Huijun Liu*, Mingzhe Du, See-Kiong Ng National University of Singapore {jibin, mingzhe, seekiong}@nus.edu.sg, liuhuijun01@gmail.com
Pseudocode	Yes	Algorithm 1: Citation Insurance Mechanism
Open Source Code	Yes	The Appendix section can be found in the full paper at https://github.com/jibin5167/ALCE-CoT.
Open Datasets	Yes	The ALCE Benchmark ALCE (Gao et al. 2023) is the first reproducible benchmark for automatically evaluating LLMs text generation with citations and allows for multiple citations for individual statements. It includes three datasets, i.e., ASQA (Stelmakh et al. 2022), QAMPARI (Rubin et al. 2022), and ELI5 (Fan et al. 2019).
Dataset Splits	Yes	The ALCE Benchmark ALCE (Gao et al. 2023) is the first reproducible benchmark for automatically evaluating LLMs text generation with citations... It pre-defines three automatic evaluation metrics, i.e., Fluency, Correctness (Correct.), and Citation Quality.
Hardware Specification	Yes	We use four NVIDIA A100 40GB GPUs to evaluate our approach. Specifically, we use one GPU to run 13B LLMs, two GPUs to run 33B LLMs, and four GPUs to run the 70B LLM.
Software Dependencies	No	The paper mentions using open-source LLMs like LLaMA-2 and IR methods like GTR and BM25, but it does not specify software dependencies with version numbers (e.g., Python, PyTorch, CUDA versions) required for replication.
Experiment Setup	Yes	For all experiments, We set the seed to 42, which is the default setting of ALCE. For each prompting strategy, we evaluate our approach on the six LLMs by setting the temperature value to 0.001, 0.1, 0.3, 0.5, 0.7, 0.9, and 1, respectively. For each type of experiment, we average the results of different temperature settings and report the averaged performance.