reproducibilityindex.ai

Paragraph-level Commonsense Transformers with Recurrent Memory

Authors: Saadia Gabriel, Chandra Bhagavatula, Vered Shwartz, Ronan Le Bras, Maxwell Forbes, Yejin Choi12857-12865

AAAI 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our results show that PARA-COMET outperforms the sentence-level baselines, particularly in generating inferences that are both coherent and novel. We show that PARA-COMET generates coherent discourse-aware inferences and performs better than discourse-agnostic baselines in both automated and manual evaluation.
Researcher Affiliation	Collaboration	1Paul G. Allen School of Computer Science & Engineering, University of Washington, Seattle, USA 2Allen Institute for Artiﬁcial Intelligence, Seattle, USA {skgabrie, mbforbes, yejin}@cs.washington.edu , {chandrab, vereds, ronanlb}@allenai.org
Pseudocode	No	The paper describes the model's architecture and steps (e.g., "Memory-augmented model" section with equations and descriptions) but does not provide any explicitly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	Code and data is available at https://github.com/skgabriel/ paracomet.
Open Datasets	Yes	The basis for our dataset are English stories from the ROCStories corpus (Mostafazadeh et al. 2016)
Dataset Splits	Yes	We split the original ROCStories train set into train, dev, and test sets in a 90/5/5 ratio.
Hardware Specification	Yes	For both training and decoding, all experiments are run using 64 Intel(R) Xeon(R) Gold 6130 x86-64 CPUs at 2.10GHz and a Quadro RTX 8000 GPU.
Software Dependencies	No	The paper states: "All models are implemented using the Transformers package (Wolf et al. 2020)," but it does not provide specific version numbers for this package or any other software dependencies.
Experiment Setup	Yes	All models are implemented using the Transformers package (Wolf et al. 2020), and trained for a maximum of 20 epochs. Training is performed using an Adam optimizer with linear warmup (Kingma and Ba 2015). We also simulate a batch size of 16 using gradient accumulation and an actual batch size of 4. The learning rate is 2 10 5 for GPT2. For GPT we use a learning rate of 6.25 10 5. All other hyperparameters follow (Radford et al. 2019; Radford 2018). We retrieve the top k = 1 inferences from memory.10 We use the 124M parameter version of the GPT2 model.