Automated Data Extraction Using Predictive Program Synthesis

Authors: Mohammad Raza, Sumit Gulwani

AAAI 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We describe concrete instantiations of such DSLs and the synthesis algorithm in the two practical application domains of text extraction and web extraction, and present an evaluation of our technique on a range of extraction tasks encountered in practice.The average execution time per task was 4.2 seconds, although 16 tasks were completed in under 2 seconds.Our system extracted 5.85 fields per page on average.
Researcher Affiliation Industry Mohammad Raza Microsoft Corporation One Microsoft Way Redmond, Washington, 98052 moraza@microsoft.comSumit Gulwani Microsoft Corporation One Microsoft Way Redmond, Washington, 98052 sumitg@microsoft.com
Pseudocode Yes Figure 6: Program synthesis algorithmFigure 7: Generic lifting function for operator rulesFigure 8: Lifting function for the ID filter operator in Lw
Open Source Code No We have implemented our generic predictive synthesis algorithm as a new learning strategy in the PROSE framework (Polozov and Gulwani 2015), which is a library of program synthesis algorithms that allows the user to simply provide a DSL and other domain-specific parameters to get a PBE tool for free. The paper states their algorithm is implemented within an existing framework but does not explicitly state that their specific implementation or code is open-source or publicly available.
Open Datasets No For evaluation in the text domain, we collected a set of 20 benchmark cases from product teams, help forums, as well as real users in our organization who provided us with data sets on which they would like to perform extraction. In the case of web extraction, we evaluated our system on a collection of 20 webpages that contain tabular data not represented using explicit HTML table tags. The paper mentions using custom datasets collected from internal sources or specific webpages but does not provide any links, DOIs, repositories, or formal citations for public access to these datasets.
Dataset Splits No The paper does not explicitly provide information about training, validation, or test dataset splits. It mentions evaluating on "input examples" and "benchmark cases" but not in terms of distinct data partitions for reproduction.
Hardware Specification No The paper does not provide any specific details about the hardware (e.g., CPU, GPU models, memory) used to run the experiments. It only discusses the execution time.
Software Dependencies No We have implemented our generic predictive synthesis algorithm as a new learning strategy in the PROSE framework (Polozov and Gulwani 2015), which is a library of program synthesis algorithms... with a user interface implemented in a Microsoft Excel add-in. The paper mentions using the PROSE framework and Microsoft Excel but does not provide specific version numbers for either of these, or any other critical software dependencies.
Experiment Setup No The paper discusses the algorithm's parameters like "Max Depth configuration parameter" and general principles, but it does not provide specific experimental setup details such as hyperparameter values (e.g., learning rates, batch sizes, optimizer settings) or other system-level training configurations used for their evaluations.