Symbolic Brittleness in Sequence Models: On Systematic Generalization in Symbolic Mathematics

Authors: Sean Welleck, Peter West, Jize Cao, Yejin Choi8629-8637

AAAI 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We develop a methodology for evaluating generalization that takes advantage of the problem domain s structure and access to a verifier. Despite promising in-distribution performance of sequence-to-sequence models in this domain, we demonstrate challenges in achieving robustness, compositionality, and outof-distribution generalization, through both carefully constructed manual test suites and a genetic algorithm that automatically finds large collections of failures in a controllable manner.
Researcher Affiliation Collaboration Paul G. Allen School of Computer Science & Engineering, University of Washington 2 Allen Institute for Artificial Intelligence wellecks@uw.edu
Pseudocode Yes Algorithm 1: SAGGA. Each seed problem denoted as ˆx, mutated problem as x, archived problem as x.
Open Source Code No We use the implementation and pretrained model from Lample and Charton (2019) for all of our experiments, specifically the FWD+BWD+IBP model which obtained top-10 accuracies of 95.6%, 99.5%, and 99.6% on their publicly available test sets.1
Open Datasets Yes Neural sequence integrator. Lample and Charton (2019) frame symbolic integration as a sequence-to-sequence problem. In this view, input and output equations x and y are prefix-notation sequences. The neural sequence integrator uses a 6-layer transformer (Vaswani et al. 2017) to model the distribution pθ(y|x) = QTy t=1 pθ(yt|y<t, x) by training the model to maximize the log-likelihood of a set of training problems, arg maxθ P (x,y) D log pθ(y|x).
Dataset Splits Yes We use the validation set, and perturb validation problems that the model correctly integrates using the neighborhoods, k f, k f}, XN2 = {f + ex, f + ln(x)}, where k U(1, 100).
Hardware Specification No The paper does not provide specific hardware details (e.g., GPU/CPU models, memory) used for running its experiments.
Software Dependencies No Following the authors, we use Sympy to check whether the derivative of a prediction is equal to the original problem. (No version specified for Sympy)
Experiment Setup Yes Experimental setup. We use the implementation and pretrained model from Lample and Charton (2019) for all of our experiments, specifically the FWD+BWD+IBP model... Our evaluation is based on their code, we use their utilities for inputs and outputs, and by default use beam search with beam-size 10. Following the authors, we use Sympy to check whether the derivative of a prediction is equal to the original problem.