Automated Data Extraction Using Predictive Program Synthesis
Authors: Mohammad Raza, Sumit Gulwani
AAAI 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We describe concrete instantiations of such DSLs and the synthesis algorithm in the two practical application domains of text extraction and web extraction, and present an evaluation of our technique on a range of extraction tasks encountered in practice.The average execution time per task was 4.2 seconds, although 16 tasks were completed in under 2 seconds.Our system extracted 5.85 fields per page on average. |
| Researcher Affiliation | Industry | Mohammad Raza Microsoft Corporation One Microsoft Way Redmond, Washington, 98052 moraza@microsoft.comSumit Gulwani Microsoft Corporation One Microsoft Way Redmond, Washington, 98052 sumitg@microsoft.com |
| Pseudocode | Yes | Figure 6: Program synthesis algorithmFigure 7: Generic lifting function for operator rulesFigure 8: Lifting function for the ID filter operator in Lw |
| Open Source Code | No | We have implemented our generic predictive synthesis algorithm as a new learning strategy in the PROSE framework (Polozov and Gulwani 2015), which is a library of program synthesis algorithms that allows the user to simply provide a DSL and other domain-specific parameters to get a PBE tool for free. The paper states their algorithm is implemented within an existing framework but does not explicitly state that their specific implementation or code is open-source or publicly available. |
| Open Datasets | No | For evaluation in the text domain, we collected a set of 20 benchmark cases from product teams, help forums, as well as real users in our organization who provided us with data sets on which they would like to perform extraction. In the case of web extraction, we evaluated our system on a collection of 20 webpages that contain tabular data not represented using explicit HTML table tags. The paper mentions using custom datasets collected from internal sources or specific webpages but does not provide any links, DOIs, repositories, or formal citations for public access to these datasets. |
| Dataset Splits | No | The paper does not explicitly provide information about training, validation, or test dataset splits. It mentions evaluating on "input examples" and "benchmark cases" but not in terms of distinct data partitions for reproduction. |
| Hardware Specification | No | The paper does not provide any specific details about the hardware (e.g., CPU, GPU models, memory) used to run the experiments. It only discusses the execution time. |
| Software Dependencies | No | We have implemented our generic predictive synthesis algorithm as a new learning strategy in the PROSE framework (Polozov and Gulwani 2015), which is a library of program synthesis algorithms... with a user interface implemented in a Microsoft Excel add-in. The paper mentions using the PROSE framework and Microsoft Excel but does not provide specific version numbers for either of these, or any other critical software dependencies. |
| Experiment Setup | No | The paper discusses the algorithm's parameters like "Max Depth configuration parameter" and general principles, but it does not provide specific experimental setup details such as hyperparameter values (e.g., learning rates, batch sizes, optimizer settings) or other system-level training configurations used for their evaluations. |