Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Mind the Quote: Enabling Quotation-Aware Dialogue in LLMs via Plug-and-Play Modules

Authors: Yueqi Zhang, Peiwen Yuan, Yiwei Li, Shaoxiong Feng, Xinglin Wang, Jiayi Shi, Chuyi Tan, Boyuan Pan, Yao Hu, Prof. Kan

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiments across models show that Qu Ada is suitable for all scenarios and generalizes to unseen topics, offering an effective, plug-and-play solution for quotation-aware dialogue.1
Researcher Affiliation	Collaboration	Yueqi Zhang1 , Peiwen Yuan1 , Yiwei Li1, Shaoxiong Feng2, Xinglin Wang1, Jiayi Shi1 Chuyi Tan1, Boyuan Pan2 , Yao Hu2, Kan Li1 1School of Computer Science and Technology, Beijing Institute of Technology 2Xiaohongshu Inc
Pseudocode	No	The paper describes the QUADA method using mathematical equations and text, but it does not include a clearly labeled pseudocode or algorithm block.
Open Source Code	Yes	1Our code is available at https://github.com/marvelcell/Mindthe Quote
Open Datasets	Yes	For the Coref scenario, we directly utilize the gold pronoun antecedent spans from the CONLL-2012 corpus [Pradhan et al., 2012], embedding them verbatim into H to keep offsets valid.
Dataset Splits	Yes	Table 3: Number of samples per scenario. Scenario Training set Benchmark MCQ Open-Ended Total MCQ Open-Ended Total Base 2200 2200 4400 500 500 1000 Multi-Span 2400 2400 4800 500 500 1000 Exclude 2400 2400 4800 500 500 1000 Info-Combine 2300 2300 4600 500 500 1000 Coref 2200 2200 4400 500 500 1000
Hardware Specification	Yes	Both training and inference were conducted on eight H20 GPUs (96 GB each).
Software Dependencies	Yes	We adopt two instruction-tuned LLMs with different scales and architectures: QWEN2.5-3B-INSTRUCT [Yang et al., 2024] and LLAMA-3.1-8B-INSTRUCT [Grattafiori et al., 2024].
Experiment Setup	Yes	For QUADA, we set the query and value side bottleneck width to r = 256, which introduces 75 M trainable parameters on Qwen (2.8% of the model) and 130 M on Llama (1.6%). All backbone weights are frozen. ... Training Qwen-2.5-3B-Instruct for three epochs completed in 1 h 25 min on the same 8 H20 setup... The inference temperature was set to 1.