Measuring Systematic Generalization in Neural Proof Generation with Transformers
Authors: Nicolas Gontier, Koustuv Sinha, Siva Reddy, Chris Pal
NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We investigate their systematic generalization abilities on a logical reasoning task in natural language, which involves reasoning over relationships between entities grounded in first-order logical proofs. Specifically, we perform soft theorem-proving by leveraging TLMs to generate natural language proofs. We test the generated proofs for logical consistency, along with the accuracy of the final inference. We observe length-generalization issues when evaluated on longer-than-trained sequences. However, we observe TLMs improve their generalization performance after being exposed to longer, exhaustive proofs. In addition, we discover that TLMs are able to generalize better using backward-chaining proofs compared to their forward-chaining counterparts, while they find it easier to generate forward chaining proofs. We observe that models that are not trained to generate proofs are better at generalizing to problems based on longer proofs. This suggests that Transformers have efficient internal reasoning strategies that are harder to interpret. These results highlight the systematic generalization behavior of TLMs in the context of logical reasoning, and we believe this work motivates deeper inspection of their underlying reasoning strategies. |
| Researcher Affiliation | Collaboration | Nicolas Gontier Quebec Artifical Intelligence Institute (Mila), Polytechnique Montréal gontiern@mila.quebec Koustuv Sinha Quebec Artifical Intelligence Institute (Mila), Mc Gill University Facebook AI Research Siva Reddy Quebec Artifical Intelligence Institute (Mila), Mc Gill University Facebook CIFAR AI Chair Christopher Pal Quebec Artifical Intelligence Institute (Mila), Polytechnique Montréal Element AI Canada CIFAR AI Chair |
| Pseudocode | Yes | Pseudo-code for generating this type of proof can be found in Appendix 6.2. |
| Open Source Code | Yes | Dataset and code can be downloaded at https://github.com/NicolasAG/SGinPG |
| Open Datasets | Yes | For natural language theorem proving, we use the question answering CLUTRR benchmark suite (Sinha et al., 2019) to perform controlled studies. |
| Dataset Splits | No | The paper specifies training and test set details but does not explicitly mention a separate validation set or its split percentage/size for reproduction. It mentions interpolation levels for evaluation, which are part of the test set. |
| Hardware Specification | No | The paper does not provide specific details about the hardware used for running experiments, such as GPU/CPU models or specific cloud instances. |
| Software Dependencies | No | The paper describes the model architecture (Transformer-Decoder), and mentions using GPT2, but does not provide specific version numbers for any software libraries or dependencies (e.g., PyTorch 1.x, TensorFlow 2.x). |
| Experiment Setup | Yes | We use a Transformer-Decoder architecture (Liu et al., 2018) with 4 layers and 8 attention heads with a model dimension of 256 for facts stories, and 12 layers and 12 attention heads with a model dimension of 768 for amt stories. We train using a learning rate of 1e-4 using the Adam optimizer (Kingma and Ba, 2015). We train models for 300k steps with a batch size of 200. We use a linear learning rate warm-up for the first 10k steps. We use label smoothing with parameter 0.1. |