Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

NeedleInATable: Exploring Long-Context Capability of Large Language Models towards Long-Structured Tables

Authors: Lanrui Wang, Mingyu Zheng, Hongyin Tang, Zheng Lin, Yanan Cao, Jingang Wang, Xunliang Cai, Weiping Wang

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our comprehensive evaluation of various LLMs and multimodal LLMs reveals a substantial performance gap between popular downstream tabular tasks and the simpler NIAT task... With the help of NIAT benchmark, we evaluate a wide spectrum of LLMs and MLLMs including mainstream open-source models and close-sourced GPT-4o.
Researcher Affiliation	Collaboration	Lanrui Wang1 , Mingyu Zheng1,2 , Hongyin Tang3 , Zheng Lin1,2 , Yanan Cao1,2, Jingang Wang3 , Xunliang Cai3 , Weiping Wang1 1Institute of Information Engineering, Chinese Academy of Sciences 2School of Cyber Security, University of Chinese Academy of Sciences 3Meituan
Pseudocode	No	The paper describes methods in narrative text and figures (e.g., Figure 2 for benchmark construction pipeline, Figure 3 for attention patterns) but does not include any structured pseudocode or algorithm blocks.
Open Source Code	Yes	4Our code and data are available at: https://github.com/wlr737/Needle In ATable
Open Datasets	Yes	To fill the gap in existing benchmarks and provide the community with the first long-context benchmark towards structured tables, it is most cost-effective to build such a benchmark based on tables from publicly aviable datasets... (1) Flat tables: we randomly select 250 tables from WTQ benchmark [19]... (2) Hierarchical tables: we randomly select 220 tables and 30 tables from Hi Tab [40] and AIT-QA [41] benchmarks respectively... We evaluate model performance on 4 downstream benchmarks. WTQ [19] and Hi Tab [40]... Tab Fact [20]... TABMWP [54].
Dataset Splits	Yes	Our benchmark contains 750 tables and up to 287K questions in total... we randomly select 60% table cells from each table as target cells to create NIAT queries... Finally, we obtain 142K cell-lookup queries and 145K cell-locating queries... In the end, 6K instruction-tuning data were constructed for cell-locating and cell-lookup tasks, respectively... For the Data Mixing Strategy, we randomly sample 8,400 NIAT training instances and 6,497 table-lookup question-answer pairs, combining them into the training dataset.
Hardware Specification	Yes	For evaluating all foundation models and the further fine-tuned models, we adopt the original generation configurations of the models and utilize the vLLM framework for deployment on a machine equipped with 8 NVIDIA A100-80G GPUs, as the input sequences can be extremely long.
Software Dependencies	No	To efficiently fine-tune Qwen2.5-7B-Instruct and Llama3.1-8B-Instruct on our synthetic data, we employ the NVIDIA Megatron framework... For evaluating all foundation models and the further fine-tuned models, we adopt the original generation configurations of the models and utilize the vLLM framework for deployment on a machine equipped with 8 NVIDIA A100-80G GPUs...
Experiment Setup	Yes	For all baselines, we adopt the zero-shot setting during evaluation... We add JSON output requirements in prompts of two tasks to minimize errors during answer parsing... Each model is fully fine-tuned on the mixed data for 2 epochs... To avoid out-of-memory errors caused by long-context input sequences, we configure tensor parallelism (tp) to 8 and model pipeline (pp) to 4.