Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
TabFact: A Large-scale Dataset for Table-based Fact Verification
Authors: Wenhu Chen, Hongmin Wang, Jianshu Chen, Yunkai Zhang, Hong Wang, Shiyang Li, Xiyou Zhou, William Yang Wang
ICLR 2020 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We perform extensive experiments to investigate their performances: the best-achieved accuracy of both models are reasonable, but far below human performance. |
| Researcher Affiliation | Collaboration | University of California, Santa Barbara, CA, USA Tencent AI Lab, Bellevue, WA, USA |
| Pseudocode | Yes | Algorithm 1 Latent Program Search with Comments |
| Open Source Code | Yes | The data and code of the dataset are provided in https://github.com/wenhuchen/Table-Fact-Checking. |
| Open Datasets | Yes | To this end, we construct a large-scale dataset called Tab Fact with 16k Wikipedia tables as the evidence for 118k human-annotated natural language statements, which are labeled as either ENTAILED or REFUTED. The data and code of the dataset are provided in https://github.com/wenhuchen/Table-Fact-Checking. |
| Dataset Splits | Yes | We split the whole data roughly with 8:1:1 into train, validation7, and test splits and shows their statistics in Table 1. Table 1: ... Val 12,792 |
| Hardware Specification | Yes | We finetune the model on a single TITAN X GPU with a mini-batch size of 6. ... We run the latent program search in a distributed fashion on three 64-core machines |
| Software Dependencies | No | The paper mentions using "open-source implementation of BERT" and "Transformer-based two-way encoder" but does not provide specific version numbers for these or other software libraries/frameworks (e.g., PyTorch, TensorFlow, Python version). |
| Experiment Setup | Yes | We finetune the model on a single TITAN X GPU with a mini-batch size of 6. The best performance is reached after about 3 hours of training (around 10K steps). ... For the discriminator model, we design two transformer-based encoders (3 layers, 128-dimension hidden embedding, and 4 heads at each layer) to encode the programs and statements, respectively. |