Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Accurate and Regret-Aware Numerical Problem Solver for Tabular Question Answering
Authors: Yuxiang Wang, Jianzhong Qi, Junhao Gan
AAAI 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimental results on two benchmark datasets show that Tab La P is substantially more accurate than the state-of-the-art models, improving the answer accuracy by 5.7% and 5.8% on the two datasets, respectively. |
| Researcher Affiliation | Academia | Yuxiang Wang, Jianzhong Qi*, Junhao Gan School of Computing and Information Systems, The University of Melbourne EMAIL, EMAIL |
| Pseudocode | Yes | Algorithm 1: Table Question Answering with Tab La P |
| Open Source Code | Yes | Code https://github.com/yxw-11/Tab La P |
| Open Datasets | Yes | We conduct experiments to test the effectiveness of Tab La P on Wiki Table Quesetions (Pasupat and Liang 2015) and FTQ. Wiki Table Quesetions is a public dataset, while FTQ is adapted by us from the Fe Ta QA dataset (Nan et al. 2022) by removing answer tokens non-directly relevant to the questions. |
| Dataset Splits | Yes | Dataset # QA Pairs # Numerical Questions Training Testing Training Testing WTQ 11,321 4,344 5,461 2,148 FTQ 2,000 1,245 417 182 Tab Fact small 92,283 2,024 16,956 368 |
| Hardware Specification | Yes | All experiments are run with two NVIDIA A100 80 GB GPUs on a cloud GPU server. |
| Software Dependencies | No | The paper mentions using a "Python interpreter" and "GPT-3.5 Turbo as the backbone model of Num Solver", and "Llama3-8B-Instruct as Ans Selector", but does not specify version numbers for Python or any libraries used in the implementation of Tab La P. |
| Experiment Setup | Yes | We fine-tune the Ans Selector and Tw Evaluator LLMs with the Adam W optimizer (Loshchilov and Hutter 2019) using a learning rate of 0.0002 and a weight decay 0.001. The maximum number of input tokens is 5,000, and the maximum number of epochs is 20. |