Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Multimodal Tabular Reasoning with Privileged Structured Information

Authors: Jun-Peng Jiang, Yu Xia, Hai-Long Sun, Shiyin Lu, Qingguo Chen, Weihua Luo, Kaifu Zhang, De-Chuan Zhan, Han-Jia Ye

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental results demonstrate that, with limited (9k) data, TURBO achieves state-of-the-art performance (+7.2% vs. previous SOTA) across multiple datasets.
Researcher Affiliation Collaboration 1 School of Artificial Intelligence, Nanjing University, China 2 National Key Laboratory for Novel Software Technology, Nanjing University, China 3 AI Business, Alibaba Group. EMAIL EMAIL
Pseudocode No The paper describes methods and processes in narrative text and figures, but does not include a clearly labeled pseudocode block or algorithm.
Open Source Code Yes The data is available. The code is easy to follow according to existing repositories.
Open Datasets Yes Specifically, for Tabular Question Answering (TQA) tasks, we use five representative datasets: TABMWP [41], WTQ [52], Hi Tab [12], and TAT-QA [94]... For Table Fact Verification (TFV) tasks, we include Tab Fact [10] and Info Tabs [23]... Additionally, we conduct experiments on MMMU [86].
Dataset Splits Yes We randomly sample 100 examples from each dataset for testing... We randomly extract 2,000 instances, resulting in a combined dataset of 10,000 training instances... After reject sampling, we obtain approximately 9k high-quality examples that serve as reliable supervision for enhancing multimodal tabular reasoning.
Hardware Specification Yes During the SFT stage, we train our model using 4 A100 GPUs for 1 hour... In the RL stage, we further enhance the model s reasoning ability by applying GRPO with 8 A100 GPUs for 24 hours... 4 A100-80G GPUs... 8 A100-80G GPUs.
Software Dependencies No The paper mentions using 'Ovis2 [42]' as the base model but does not specify software dependencies like Python, PyTorch, or CUDA versions.
Experiment Setup Yes During the SFT stage, we train our model using 4 A100 GPUs for 1 hour with a batch size of 128 and a learning rate of 2e-6... We use a smaller learning rate of 5e-7 with the same batch size of 128. For each question, the model generates 16 candidate responses, and we apply a CLIP range of 0.2 to stabilize reward scaling. The sequence lengths are extended to 2000 (text) and 4596 (multimodal). We retain the Adam W optimizer and a warmup ratio of 0.1 across both stages, with no weight decay.