Improving Coherence and Consistency in Neural Sequence Models with Dual-System, Neuro-Symbolic Reasoning
Authors: Maxwell Nye, Michael Tessler, Josh Tenenbaum, Brenden M. Lake
NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Results in robust story generation and grounded instruction-following show that this approach can increase the coherence and accuracy of neurally-based generations. |
| Researcher Affiliation | Collaboration | Maxwell Nye MIT Michael Henry Tessler MIT Deep Mind Joshua B. Tenenbaum MIT Brenden M. Lake NYU Facebook AI Research |
| Pseudocode | No | The paper includes schematic diagrams (Figure 1, Figure 6) illustrating the system's flow but does not contain any formal pseudocode blocks or algorithm listings. |
| Open Source Code | No | The paper does not provide any explicit statements or links indicating that source code for the described methodology is available. |
| Open Datasets | Yes | We first illustrate the approach by generating short stories based on the bAbI dataset (Weston et al., 2015); this pedagogical, synthetic example illustrates how basic commonsense knowledge of objects, agents, and places can inform a text generation model. We then test our approach on rich, natural language vignettes based on CLUTRR (Sinha et al., 2019), focusing on ensuring consistency of family and interpersonal relationships. We use the gSCAN benchmark (Ruis et al., 2020), a recently proposed grounded instruction following dataset designed to measure compositional generalization in neural systems. |
| Dataset Splits | No | The paper mentions training data sizes for gSCAN ("5000 datapoints, 8000 datapoints, and 20000 datapoints") and refers to a "dev" split in Table 2, but it does not provide specific percentages or counts for training/validation/test splits, nor does it detail a cross-validation setup for full reproducibility. |
| Hardware Specification | No | The paper does not provide any specific details regarding the hardware used for running the experiments (e.g., GPU models, CPU types, or memory specifications). |
| Software Dependencies | No | The paper mentions several models and tools used (e.g., GPT-3, BART, Z3 solver, RoBERTa MNLI), but it does not specify version numbers for these or other software dependencies, which are necessary for reproducibility. |
| Experiment Setup | Yes | For the bAbI examples, we use GPT-3 as our System 1 proposal model through few-shot prompting with 10 example bAbI stories as context, generating a new story one candidate sentence at a time. For all System 1 generations, we used model temperature of 1.0. For the neural NLI baseline, we used 0.9 probability of contradiction as the cutoff for rejection. Our dual-system model uses a sampling budget of 10 System 1 samples per sentence. In our experiments, we use a sample-based search with a maximum budget of 50 samples. |