reproducibilityindex.ai

Measuring Systematic Generalization in Neural Proof Generation with Transformers

Authors: Nicolas Gontier, Koustuv Sinha, Siva Reddy, Chris Pal

NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We investigate their systematic generalization abilities on a logical reasoning task in natural language, which involves reasoning over relationships between entities grounded in ﬁrst-order logical proofs. Speciﬁcally, we perform soft theorem-proving by leveraging TLMs to generate natural language proofs. We test the generated proofs for logical consistency, along with the accuracy of the ﬁnal inference. We observe length-generalization issues when evaluated on longer-than-trained sequences. However, we observe TLMs improve their generalization performance after being exposed to longer, exhaustive proofs. In addition, we discover that TLMs are able to generalize better using backward-chaining proofs compared to their forward-chaining counterparts, while they ﬁnd it easier to generate forward chaining proofs. We observe that models that are not trained to generate proofs are better at generalizing to problems based on longer proofs. This suggests that Transformers have efﬁcient internal reasoning strategies that are harder to interpret. These results highlight the systematic generalization behavior of TLMs in the context of logical reasoning, and we believe this work motivates deeper inspection of their underlying reasoning strategies.
Researcher Affiliation	Collaboration	Nicolas Gontier Quebec Artiﬁcal Intelligence Institute (Mila), Polytechnique Montréal gontiern@mila.quebec Koustuv Sinha Quebec Artiﬁcal Intelligence Institute (Mila), Mc Gill University Facebook AI Research Siva Reddy Quebec Artiﬁcal Intelligence Institute (Mila), Mc Gill University Facebook CIFAR AI Chair Christopher Pal Quebec Artiﬁcal Intelligence Institute (Mila), Polytechnique Montréal Element AI Canada CIFAR AI Chair
Pseudocode	Yes	Pseudo-code for generating this type of proof can be found in Appendix 6.2.
Open Source Code	Yes	Dataset and code can be downloaded at https://github.com/NicolasAG/SGinPG
Open Datasets	Yes	For natural language theorem proving, we use the question answering CLUTRR benchmark suite (Sinha et al., 2019) to perform controlled studies.
Dataset Splits	No	The paper specifies training and test set details but does not explicitly mention a separate validation set or its split percentage/size for reproduction. It mentions interpolation levels for evaluation, which are part of the test set.
Hardware Specification	No	The paper does not provide specific details about the hardware used for running experiments, such as GPU/CPU models or specific cloud instances.
Software Dependencies	No	The paper describes the model architecture (Transformer-Decoder), and mentions using GPT2, but does not provide specific version numbers for any software libraries or dependencies (e.g., PyTorch 1.x, TensorFlow 2.x).
Experiment Setup	Yes	We use a Transformer-Decoder architecture (Liu et al., 2018) with 4 layers and 8 attention heads with a model dimension of 256 for facts stories, and 12 layers and 12 attention heads with a model dimension of 768 for amt stories. We train using a learning rate of 1e-4 using the Adam optimizer (Kingma and Ba, 2015). We train models for 300k steps with a batch size of 200. We use a linear learning rate warm-up for the ﬁrst 10k steps. We use label smoothing with parameter 0.1.