Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Symbolic Brittleness in Sequence Models: On Systematic Generalization in Symbolic Mathematics
Authors: Sean Welleck, Peter West, Jize Cao, Yejin Choi8629-8637
AAAI 2022 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We develop a methodology for evaluating generalization that takes advantage of the problem domain s structure and access to a verifier. Despite promising in-distribution performance of sequence-to-sequence models in this domain, we demonstrate challenges in achieving robustness, compositionality, and outof-distribution generalization, through both carefully constructed manual test suites and a genetic algorithm that automatically finds large collections of failures in a controllable manner. |
| Researcher Affiliation | Collaboration | Paul G. Allen School of Computer Science & Engineering, University of Washington 2 Allen Institute for Artificial Intelligence EMAIL |
| Pseudocode | Yes | Algorithm 1: SAGGA. Each seed problem denoted as ˆx, mutated problem as x, archived problem as x. |
| Open Source Code | No | We use the implementation and pretrained model from Lample and Charton (2019) for all of our experiments, specifically the FWD+BWD+IBP model which obtained top-10 accuracies of 95.6%, 99.5%, and 99.6% on their publicly available test sets.1 |
| Open Datasets | Yes | Neural sequence integrator. Lample and Charton (2019) frame symbolic integration as a sequence-to-sequence problem. In this view, input and output equations x and y are prefix-notation sequences. The neural sequence integrator uses a 6-layer transformer (Vaswani et al. 2017) to model the distribution pθ(y|x) = QTy t=1 pθ(yt|y<t, x) by training the model to maximize the log-likelihood of a set of training problems, arg maxθ P (x,y) D log pθ(y|x). |
| Dataset Splits | Yes | We use the validation set, and perturb validation problems that the model correctly integrates using the neighborhoods, k f, k f}, XN2 = {f + ex, f + ln(x)}, where k U(1, 100). |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., GPU/CPU models, memory) used for running its experiments. |
| Software Dependencies | No | Following the authors, we use Sympy to check whether the derivative of a prediction is equal to the original problem. (No version specified for Sympy) |
| Experiment Setup | Yes | Experimental setup. We use the implementation and pretrained model from Lample and Charton (2019) for all of our experiments, specifically the FWD+BWD+IBP model... Our evaluation is based on their code, we use their utilities for inputs and outputs, and by default use beam search with beam-size 10. Following the authors, we use Sympy to check whether the derivative of a prediction is equal to the original problem. |