Paragraph-level Commonsense Transformers with Recurrent Memory

Authors: Saadia Gabriel, Chandra Bhagavatula, Vered Shwartz, Ronan Le Bras, Maxwell Forbes, Yejin Choi12857-12865

AAAI 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our results show that PARA-COMET outperforms the sentence-level baselines, particularly in generating inferences that are both coherent and novel. We show that PARA-COMET generates coherent discourse-aware inferences and performs better than discourse-agnostic baselines in both automated and manual evaluation.
Researcher Affiliation Collaboration 1Paul G. Allen School of Computer Science & Engineering, University of Washington, Seattle, USA 2Allen Institute for Artificial Intelligence, Seattle, USA {skgabrie, mbforbes, yejin}@cs.washington.edu , {chandrab, vereds, ronanlb}@allenai.org
Pseudocode No The paper describes the model's architecture and steps (e.g., "Memory-augmented model" section with equations and descriptions) but does not provide any explicitly labeled pseudocode or algorithm blocks.
Open Source Code Yes Code and data is available at https://github.com/skgabriel/ paracomet.
Open Datasets Yes The basis for our dataset are English stories from the ROCStories corpus (Mostafazadeh et al. 2016)
Dataset Splits Yes We split the original ROCStories train set into train, dev, and test sets in a 90/5/5 ratio.
Hardware Specification Yes For both training and decoding, all experiments are run using 64 Intel(R) Xeon(R) Gold 6130 x86-64 CPUs at 2.10GHz and a Quadro RTX 8000 GPU.
Software Dependencies No The paper states: "All models are implemented using the Transformers package (Wolf et al. 2020)," but it does not provide specific version numbers for this package or any other software dependencies.
Experiment Setup Yes All models are implemented using the Transformers package (Wolf et al. 2020), and trained for a maximum of 20 epochs. Training is performed using an Adam optimizer with linear warmup (Kingma and Ba 2015). We also simulate a batch size of 16 using gradient accumulation and an actual batch size of 4. The learning rate is 2 10 5 for GPT2. For GPT we use a learning rate of 6.25 10 5. All other hyperparameters follow (Radford et al. 2019; Radford 2018). We retrieve the top k = 1 inferences from memory.10 We use the 124M parameter version of the GPT2 model.