Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
MultiTalk: A Highly-Branching Dialog Testbed for Diverse Conversations
Authors: Yao Dou, Maxwell Forbes, Ari Holtzman, Yejin Choi12760-12767
AAAI 2021 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments and Results |
| Researcher Affiliation | Collaboration | 1 University of Washington, 2 Allen Institute for AI |
| Pseudocode | No | The paper describes algorithms and models conceptually but does not provide structured pseudocode or algorithm blocks. |
| Open Source Code | No | The paper states 'We collect and release1 a large dataset of highly-branching written conversations. 1https://uwnlp.github.io/multitalk/' but this link is specified for the dataset and not explicitly for the open-source code of the methodology itself. |
| Open Datasets | Yes | We collect and release1 a large dataset of highly-branching written conversations. The dataset contains 320,804 individual responses in a conversation tree. [...] 1https://uwnlp.github.io/multitalk/ |
| Dataset Splits | No | The paper mentions a 'validation set' in the context of an 'oracle' baseline (Table 6) but does not explicitly provide specific dataset split information (percentages, sample counts, or detailed methodology) for train/validation/test splits needed for reproduction. |
| Hardware Specification | No | The paper mentions 'available resources' for training but does not provide specific hardware details such as GPU/CPU models or memory amounts. |
| Software Dependencies | No | The paper mentions several software components like BERT-Large, GPT-2, SciPy, and GloVe, but it does not provide specific version numbers for any of them. |
| Experiment Setup | Yes | At inference time, we sample from all models using top-p sampling with p = 0.9 (Holtzman et al. 2019). [...] To prevent biasing the language model to utterances higher in the dialog tree, we compute loss for the model only for tokens in the ๏ฌnal utterance. [...] for the theory of mind task, and set ฮณ = 0 to account only for the emotion of a response s immediate children. [...] We ๏ฌne-tune GPT-2 M (345M params.). |