Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Accurate and Regret-Aware Numerical Problem Solver for Tabular Question Answering

Authors: Yuxiang Wang, Jianzhong Qi, Junhao Gan

AAAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experimental results on two benchmark datasets show that Tab La P is substantially more accurate than the state-of-the-art models, improving the answer accuracy by 5.7% and 5.8% on the two datasets, respectively.
Researcher Affiliation	Academia	Yuxiang Wang, Jianzhong Qi*, Junhao Gan School of Computing and Information Systems, The University of Melbourne EMAIL, EMAIL
Pseudocode	Yes	Algorithm 1: Table Question Answering with Tab La P
Open Source Code	Yes	Code https://github.com/yxw-11/Tab La P
Open Datasets	Yes	We conduct experiments to test the effectiveness of Tab La P on Wiki Table Quesetions (Pasupat and Liang 2015) and FTQ. Wiki Table Quesetions is a public dataset, while FTQ is adapted by us from the Fe Ta QA dataset (Nan et al. 2022) by removing answer tokens non-directly relevant to the questions.
Dataset Splits	Yes	Dataset # QA Pairs # Numerical Questions Training Testing Training Testing WTQ 11,321 4,344 5,461 2,148 FTQ 2,000 1,245 417 182 Tab Fact small 92,283 2,024 16,956 368
Hardware Specification	Yes	All experiments are run with two NVIDIA A100 80 GB GPUs on a cloud GPU server.
Software Dependencies	No	The paper mentions using a "Python interpreter" and "GPT-3.5 Turbo as the backbone model of Num Solver", and "Llama3-8B-Instruct as Ans Selector", but does not specify version numbers for Python or any libraries used in the implementation of Tab La P.
Experiment Setup	Yes	We fine-tune the Ans Selector and Tw Evaluator LLMs with the Adam W optimizer (Loshchilov and Hutter 2019) using a learning rate of 0.0002 and a weight decay 0.001. The maximum number of input tokens is 5,000, and the maximum number of epochs is 20.