Paragraph-level Commonsense Transformers with Recurrent Memory
Authors: Saadia Gabriel, Chandra Bhagavatula, Vered Shwartz, Ronan Le Bras, Maxwell Forbes, Yejin Choi12857-12865
AAAI 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our results show that PARA-COMET outperforms the sentence-level baselines, particularly in generating inferences that are both coherent and novel. We show that PARA-COMET generates coherent discourse-aware inferences and performs better than discourse-agnostic baselines in both automated and manual evaluation. |
| Researcher Affiliation | Collaboration | 1Paul G. Allen School of Computer Science & Engineering, University of Washington, Seattle, USA 2Allen Institute for Artiļ¬cial Intelligence, Seattle, USA {skgabrie, mbforbes, yejin}@cs.washington.edu , {chandrab, vereds, ronanlb}@allenai.org |
| Pseudocode | No | The paper describes the model's architecture and steps (e.g., "Memory-augmented model" section with equations and descriptions) but does not provide any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | Code and data is available at https://github.com/skgabriel/ paracomet. |
| Open Datasets | Yes | The basis for our dataset are English stories from the ROCStories corpus (Mostafazadeh et al. 2016) |
| Dataset Splits | Yes | We split the original ROCStories train set into train, dev, and test sets in a 90/5/5 ratio. |
| Hardware Specification | Yes | For both training and decoding, all experiments are run using 64 Intel(R) Xeon(R) Gold 6130 x86-64 CPUs at 2.10GHz and a Quadro RTX 8000 GPU. |
| Software Dependencies | No | The paper states: "All models are implemented using the Transformers package (Wolf et al. 2020)," but it does not provide specific version numbers for this package or any other software dependencies. |
| Experiment Setup | Yes | All models are implemented using the Transformers package (Wolf et al. 2020), and trained for a maximum of 20 epochs. Training is performed using an Adam optimizer with linear warmup (Kingma and Ba 2015). We also simulate a batch size of 16 using gradient accumulation and an actual batch size of 4. The learning rate is 2 10 5 for GPT2. For GPT we use a learning rate of 6.25 10 5. All other hyperparameters follow (Radford et al. 2019; Radford 2018). We retrieve the top k = 1 inferences from memory.10 We use the 124M parameter version of the GPT2 model. |