Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Table as a Modality for Large Language Models

Authors: Liyao Li, Chao Ye, Wentao Ye, Yifei Sun, Zhe Jiang, Haobo Wang, Jiaming Tian, Yiming Zhang, NINGTAO WANG, Xing Fu, Gang Chen, Junbo Zhao

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Empirical results on various benchmarking datasets, including Hi Tab, Wiki TQ, Wiki SQL, Fe Ta QA, and Struct QA, have demonstrated significant improvements on generalization with an average relative gain of 42.65%. In this section, we will demonstrate the advantages of treating tables as an independent modality. Section 3.1 introduces our novel benchmark, Struct QA, designed to evaluate LLMs understanding of table structures and their robustness. Sections 3.3 presents the performance gains of our approach across mainstream datasets and fine-tuning methods.
Researcher Affiliation	Collaboration	1Zhejiang University 2Ant Group 3University of Michigan EMAIL
Pseudocode	No	The paper describes the methodology using equations (Eqa.1 and Eqa.2) and descriptive text, detailing components like multiset functions and set attention blocks, but does not include a dedicated pseudocode or algorithm block.
Open Source Code	Yes	3Code and datasets are on https://github.com/liyaooi/TAMO. We also release code and datasets at https: //github.com/liyaooi/TAMO.
Open Datasets	Yes	Benchmark: We introduce Struct QA, the first open-source benchmark on the robust tabular structure understanding. Our findings reveal that current LLMs struggle with this human-friendly task. Position: Our research represents a pioneering step in integrating tables as an independent modality into LLMs. Methodology: We explore the semantic alignment of tabular structures in LLMs embedding space via 3Code and datasets are on https://github.com/liyaooi/TAMO. 4Hitab [Cheng et al., 2022], Wiki TQ [Pasupat and Liang, 2015], Wiki SQL [Zhong et al., 2017], and Fe Ta QA [Nan et al., 2022] and our proposed Struct QA benchmark (Section 3.1).
Dataset Splits	Yes	We split the data into training, validation, and test sets with a ratio of 60%, 20%, and 20%, respectively. To establish the performance upper bounds of TAMO for different tasks independently, we trained separate instances from scratch on each task-specific training set and evaluated them on the corresponding test sets6.
Hardware Specification	Yes	Experiments are conducted using 2 NVIDIA A100-80G GPUs.
Software Dependencies	No	The paper mentions using specific LLMs like 'Llama2-7b' and 'Mistral-7B' as backbones and 'AdamW' as the optimizer. However, it does not provide specific version numbers for these software components, nor for programming languages or other libraries.
Experiment Setup	Yes	The table encoder. We set the hidden dimension of the table encoder to 768 and use a 3-layer hypergraph transformer to model global table structure... In fine-tuning the LLM with Lo RA, the lora_r parameter (dimension for Lo RA update matrices) is set to 8, and the lora_alpha (scaling factor) is set to 16. The dropout rate is set to 0.05. In prompt tuning, the LLM is configured with 8 virtual tokens. The number of max text length is 1024. The number of max new tokens, i.e., the maximum number of tokens to generate, is 128... Optimization. We use the Adam W optimizer. We set the initial learning rate at 1e-5, with a weight decay of 0.05. The learning rate decays with a half-cycle cosine decay after the warm-up period. The batch size is 8, and the number of epochs is 10.