A Hybrid Probabilistic Approach for Table Understanding

Authors: Kexuan Sun, Harsha Rayudu, Jay Pujara4366-4374

AAAI 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental The evaluation results show that our system achieves the state-of-the-art performance on cell type classification, block identification, and relationship prediction, improving over prior efforts by up to 7% of macro F1 score. In this section, we present the experimental evaluation of the proposed system 5 based on the four datasets.
Researcher Affiliation Academia Kexuan Sun, Harsha Rayudu, Jay Pujara University of Southern California, Information Sciences Institute kexuansu@usc.edu, hrayudu@usc.edu, jpujara@isi.edu
Pseudocode Yes Algorithm 1: Candidate Block Generation
Open Source Code Yes 5https://github.com/kianasun/table-understanding-system
Open Datasets Yes Most existing benchmark datasets (such as De Ex (Eberius et al. 2013), SAUS (Chen and Cafarella 2013) and CIUS (Ghasemi-Gol, Pujara, and Szekely 2019)) consist of only Excel files, are from narrow domains and cover only cell functional types. we introduce a new benchmark dataset comprised of 431 tables downloaded from the U.S. Government s open data 4. 4https://www.data.gov/
Dataset Splits Yes In all experiments, we perform 5-fold cross validation on the rest of the tables: for each dataset, we randomly split the tables into 5 folds, train/validate a model using 4 folds and test on 1 fold. For the 4 folds, we randomly split the tables with 9:1 ratio into training and validation sets.
Hardware Specification No The paper does not provide any specific details about the hardware used for running the experiments, such as GPU models, CPU types, or memory specifications.
Software Dependencies No The paper mentions software libraries such as 'scikit-learn library (Buitinck et al. 2013)', 'Grid CRF class from the pystruct library (M uller and Behnke 2014)', and 'pytorch library (Paszke et al. 2019)'. While these citations indicate the year of the software's publication or a specific version at that time, explicit version numbers (e.g., 'PyTorch 1.9') are not provided.
Experiment Setup Yes We select n estimator among [100, 300], max depth among [5, 50, None], min sample split among [2, 10] and min samples leaf among [1, 10]. We use the bootstrap mode with balanced sub-sampling. We set batch size to be 32, learning rate to be 0.0001, and epoch to be 50. We use cross entropy loss.