Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
A Controllable Model of Grounded Response Generation
Authors: Zeqiu Wu, Michel Galley, Chris Brockett, Yizhe Zhang, Xiang Gao, Chris Quirk, Rik Koncel-Kedziorski, Jianfeng Gao, Hannaneh Hajishirzi, Mari Ostendorf, Bill Dolan14085-14093
AAAI 2021 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Quantitative and qualitative results show that, using this framework, a transformer based model with a novel inductive attention mechanism, trained on a conversation-like Reddit dataset, outperforms strong generation baselines. We show through qualitative and quantitative evaluations that CGRG outperforms strong baselines where: a) the control phrases are provided by a (simulated) user, and b) automatically extracted by a control phrase prediction model. |
| Researcher Affiliation | Collaboration | 1University of Washington, Seattle, WA, USA 2Microsoft Research, Redmond, WA, USA 3Allen Institute for AI, Seattle, WA, USA |
| Pseudocode | No | The paper describes mathematical formulas and procedural steps in prose but does not include any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | To further facilitate reproducibility, we release our data preparation and modeling code at https://github.com/ellenmellon/CGRG. |
| Open Datasets | Yes | We start with the grounded Reddit conversation dataset described in Qin et al. (2019). This dataset is a ๏ฌltered version of (Qin et al. 2019) s public dataset. |
| Dataset Splits | Yes | The number of utterances of train, dev and test are 390K, 6.7K and 21K, respectively. |
| Hardware Specification | Yes | Each training process is run on 2 Tesla K-80 nodes. |
| Software Dependencies | No | The paper mentions using "GPT-2" and "Dialo GPT" models/architectures, but does not provide specific version numbers for any underlying software dependencies (e.g., Python, PyTorch, TensorFlow, etc.). |
| Experiment Setup | Yes | We set the maximum number of sentences in GC to be 20 and maximum number of phrases in C to be 10, then we have 0 for X; 1-20 for GC; 21-30 for C and 31 for R tokens as type embedding. We use the small version of GPT-2 with 117M parameters, with the maximum length of the input or target response sequence to be 512. We use batch size 32. Learning rate (1e-5) and warmup steps (1600) are tuned on the dev set perplexity, with all other parameters being the same as Dialo GPT 4. |