HyTrel: Hypergraph-enhanced Tabular Data Representation Learning

Authors: Pei Chen, Soumajyoti Sarkar, Leonard Lausen, Balasubramaniam Srinivasan, Sheng Zha, Ruihong Huang, George Karypis

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our empirical results demonstrate that HYTREL consistently outperforms other competitive baselines on four downstream tasks with minimal pretraining, illustrating the advantages of incorporating the inductive biases associated with tabular data into the representations.
Researcher Affiliation Collaboration 1Texas A&M University, 2Amazon Web Services {chenpei,huangrh}@tamu.edu, {soumajs,lausen,srbalasu,zhasheng,gkarypis}@amazon.com
Pseudocode No The paper does not contain any pseudocode or clearly labeled algorithm blocks.
Open Source Code Yes 1Code is available at: https://github.com/awslabs/hypergraph-tabular-lm
Open Datasets Yes In line with previous Ta LMs [Yin et al., 2020, Iida et al., 2021], we use tables from Wikipedia and Common Crawl for pretraining. We utilize preprocessing tools provided by Yin et al. [2020] and collect a total of 27 million tables (1% are sampled and used for validation).
Dataset Splits Yes We utilize preprocessing tools provided by Yin et al. [2020] and collect a total of 27 million tables (1% are sampled and used for validation). For the TURL-CTA dataset... 13,025 (13,391) columns from 4,764 (4,844) tables for testing (validation). For the PMC dataset... report the macro-average of the 5-fold cross-validation performance.
Hardware Specification Yes Below in Table 3 are the experimental results of using different maximum row/column limitations with HYTREL (ELECTRA), and we fine-tune the dataset on an NVIDIA A10 Tensor Core GPU (24GB). All experiments are conducted on a single A10 GPU, and the inference batch sizes are all chosen to be 8 for all models and all dataset; We use the validation set of CTA, CPA and TTD for experiments.
Software Dependencies No The paper mentions software components like "Deep Speed" and "Adam" (optimizers) but does not specify their version numbers or other crucial software dependencies with versions for reproducibility.
Experiment Setup Yes With the ELECTRA pretraining objective, we randomly replace 15% of the cells or headers of an input table with values that are sampled from all the pretraining tables based on their frequency, as recommended by Iida et al. [2021]. With the contrastive pretraining objective, we corrupted 30% of the connections between nodes and hyperedges for each table to create one augmented view. The temperature τ is set as 0.007. For both objectives, we pretrain the HYTREL models for 5 epochs. More details can be found the Appendix C.1.