Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Constraint Reasoning Embedded Structured Prediction

Authors: Nan Jiang, Maosen Zhang, Willem-Jan van Hoeve, Yexiang Xue

JMLR 2022 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate Core-Sp on three applications: vehicle dispatching service planning, if-then program synthesis, and text2SQL generation. The proposed Core-Sp module demonstrates superior performance over state-of-the-art approaches in all three applications. The structures generated with Core-Sp satisfy 100% of the constraints when using exact decision diagrams. In addition, Core-Sp boosts learning performance by reducing the modeling space via constraint satisfaction.
Researcher Affiliation	Collaboration	Nan Jiang EMAIL Department of Computer Science Purdue University West Lafayette, Indiana, USA. Maosen Zhang EMAIL Byte Dance Beijing, China. Willem-Jan van Hoeve EMAIL Tepper School of Business Carnegie Mellon University Pittsburgh, Pennsylvania, USA. Yexiang Xue EMAIL Department of Computer Science Purdue University West Lafayette, Indiana, USA.
Pseudocode	Yes	Algorithm 1: Iterative algorithm for searching optimal performance of Core-Sp.
Open Source Code	Yes	The code for all the experiments is available at Git Hub.3 3. Code summary: https://jiangnanhugo.github.io/CORE-SP/
Open Datasets	Yes	Our experiments are on a data set consisting of 29 cities in Bavaria.4 ... 4. Instance bays29.tsp from TSPLIB: http://comopt.ifi.uni-heidelberg.de/software/TSPLIB95/tsp/ The data sets for this experiment are crawled from the IFTTT and Zapier websites.5,6 ... 5. IFTTT data set is collected from https://ifttt.com/ ... 6. Zapier data set is collected from https://zapier.com/ We conduct experiments on the large-scale Wiki SQL data set (Zhong et al., 2017), which contains 80, 654 examples of questions and SQL queries distributed across 24, 241 tables from Wikipedia.
Dataset Splits	Yes	Dataset #train set #val set #test set #quadruple #vocabulary IFTTT 66761 4148 2640 (111, 443, 88, 161) 4000 Zapier 24454 4809 2576 (1353, 1755, 1333, 1466) 3782
Hardware Specification	No	No specific hardware details (GPU models, CPU types, memory amounts, or cloud platform specifications) are mentioned in the paper.
Software Dependencies	No	The implementation is based on SQLNova. We use the BERT-base model (Devlin et al., 2019) as the word embedding. The entire model takes up to 3 days to train for 50 epochs. No specific software versions (e.g., Python 3.x, PyTorch 1.x, CUDA x.x) are provided for the implementations described.
Experiment Setup	Yes	The generator G uses an encoder to learn a representation vector for the input and uses a sequential decoder to generate the schedule: hj = LSTM(x, hj 1), ... The discriminator D is trained ... It uses the following LSTM structure: sj = LSTM(qj, sj 1), ... The loss function L is: min G max D Ex,y [log D (y, x)] + Ez,x,y [log (1 D (G (x, z) , y))] . The Latent Attention model is a bidirectional LSTM with residual connection, followed by the self-attention mechanism. ... During training, we use cross-entropy loss as the loss function L that minimizes the diﬀerence between the ground-truth prediction and the probabilities pts, ptf, pas, paf produced from Core-Sp ... SQLova has a sequence-to-sequence architecture. It ﬁrst encodes a natural language sentence and the table headers into a high-dimensional vector. Then the decoder of SQLova decodes the hidden representation into the predictions of various entities in the SQL query. ... The entire model takes up to 3 days to train for 50 epochs. We choose the model that achieves the best execution accuracy on the validation data set for both the baseline and Core-Sp and calculate the corresponding statistics reﬂected in Table 3.