reproducibilityindex.ai

Automated Data Extraction Using Predictive Program Synthesis

Authors: Mohammad Raza, Sumit Gulwani

AAAI 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We describe concrete instantiations of such DSLs and the synthesis algorithm in the two practical application domains of text extraction and web extraction, and present an evaluation of our technique on a range of extraction tasks encountered in practice.The average execution time per task was 4.2 seconds, although 16 tasks were completed in under 2 seconds.Our system extracted 5.85 ﬁelds per page on average.
Researcher Affiliation	Industry	Mohammad Raza Microsoft Corporation One Microsoft Way Redmond, Washington, 98052 moraza@microsoft.comSumit Gulwani Microsoft Corporation One Microsoft Way Redmond, Washington, 98052 sumitg@microsoft.com
Pseudocode	Yes	Figure 6: Program synthesis algorithmFigure 7: Generic lifting function for operator rulesFigure 8: Lifting function for the ID ﬁlter operator in Lw
Open Source Code	No	We have implemented our generic predictive synthesis algorithm as a new learning strategy in the PROSE framework (Polozov and Gulwani 2015), which is a library of program synthesis algorithms that allows the user to simply provide a DSL and other domain-speciﬁc parameters to get a PBE tool for free. The paper states their algorithm is implemented within an existing framework but does not explicitly state that their specific implementation or code is open-source or publicly available.
Open Datasets	No	For evaluation in the text domain, we collected a set of 20 benchmark cases from product teams, help forums, as well as real users in our organization who provided us with data sets on which they would like to perform extraction. In the case of web extraction, we evaluated our system on a collection of 20 webpages that contain tabular data not represented using explicit HTML table tags. The paper mentions using custom datasets collected from internal sources or specific webpages but does not provide any links, DOIs, repositories, or formal citations for public access to these datasets.
Dataset Splits	No	The paper does not explicitly provide information about training, validation, or test dataset splits. It mentions evaluating on "input examples" and "benchmark cases" but not in terms of distinct data partitions for reproduction.
Hardware Specification	No	The paper does not provide any specific details about the hardware (e.g., CPU, GPU models, memory) used to run the experiments. It only discusses the execution time.
Software Dependencies	No	We have implemented our generic predictive synthesis algorithm as a new learning strategy in the PROSE framework (Polozov and Gulwani 2015), which is a library of program synthesis algorithms... with a user interface implemented in a Microsoft Excel add-in. The paper mentions using the PROSE framework and Microsoft Excel but does not provide specific version numbers for either of these, or any other critical software dependencies.
Experiment Setup	No	The paper discusses the algorithm's parameters like "Max Depth conﬁguration parameter" and general principles, but it does not provide specific experimental setup details such as hyperparameter values (e.g., learning rates, batch sizes, optimizer settings) or other system-level training configurations used for their evaluations.