Symbolic Brittleness in Sequence Models: On Systematic Generalization in Symbolic Mathematics
Authors: Sean Welleck, Peter West, Jize Cao, Yejin Choi8629-8637
AAAI 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We develop a methodology for evaluating generalization that takes advantage of the problem domain s structure and access to a verifier. Despite promising in-distribution performance of sequence-to-sequence models in this domain, we demonstrate challenges in achieving robustness, compositionality, and outof-distribution generalization, through both carefully constructed manual test suites and a genetic algorithm that automatically finds large collections of failures in a controllable manner. |
| Researcher Affiliation | Collaboration | Paul G. Allen School of Computer Science & Engineering, University of Washington 2 Allen Institute for Artificial Intelligence wellecks@uw.edu |
| Pseudocode | Yes | Algorithm 1: SAGGA. Each seed problem denoted as ˆx, mutated problem as x, archived problem as x. |
| Open Source Code | No | We use the implementation and pretrained model from Lample and Charton (2019) for all of our experiments, specifically the FWD+BWD+IBP model which obtained top-10 accuracies of 95.6%, 99.5%, and 99.6% on their publicly available test sets.1 |
| Open Datasets | Yes | Neural sequence integrator. Lample and Charton (2019) frame symbolic integration as a sequence-to-sequence problem. In this view, input and output equations x and y are prefix-notation sequences. The neural sequence integrator uses a 6-layer transformer (Vaswani et al. 2017) to model the distribution pθ(y|x) = QTy t=1 pθ(yt|y<t, x) by training the model to maximize the log-likelihood of a set of training problems, arg maxθ P (x,y) D log pθ(y|x). |
| Dataset Splits | Yes | We use the validation set, and perturb validation problems that the model correctly integrates using the neighborhoods, k f, k f}, XN2 = {f + ex, f + ln(x)}, where k U(1, 100). |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., GPU/CPU models, memory) used for running its experiments. |
| Software Dependencies | No | Following the authors, we use Sympy to check whether the derivative of a prediction is equal to the original problem. (No version specified for Sympy) |
| Experiment Setup | Yes | Experimental setup. We use the implementation and pretrained model from Lample and Charton (2019) for all of our experiments, specifically the FWD+BWD+IBP model... Our evaluation is based on their code, we use their utilities for inputs and outputs, and by default use beam search with beam-size 10. Following the authors, we use Sympy to check whether the derivative of a prediction is equal to the original problem. |