Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
TableRAG: Million-Token Table Understanding with Language Models
Authors: Si-An Chen, Lesly Miculicich, Julian Eisenschlos, Zifeng Wang, Zilong Wang, Yanfei Chen, YASUHISA FUJII, Hsuan-Tien Lin, Chen-Yu Lee, Tomas Pfister
NeurIPS 2024 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | 4 Empirical Studies |
| Researcher Affiliation | Collaboration | 1National Taiwan University, 2Google Cloud AI Research, 3Google Deep Mind, 4UC San Diego |
| Pseudocode | Yes | The pseudocode and an answering example on Arcade QA can be found in Alg. 1 and Fig. 8 respectively. Algorithm 1: Table RAG Algorithm |
| Open Source Code | Yes | The implementation and dataset will be available at https://github.com/google-research/google-research/tree/master/table_rag. |
| Open Datasets | Yes | We build two new million-token benchmarks sourced from the real-world Arcade [26] and BIRD-SQL [7] datasets. Additionally, to assess performance across various scales, we generated synthetic data expanding tables from the Tab Fact dataset to larger sizes, while maintaining consistent questions and key table content for evaluation. |
| Dataset Splits | No | The paper doesn't explicitly provide training/validation/test dataset splits with percentages or counts in the main text. It mentions using 'evaluation' and 'test' but not specific 'validation' splits. |
| Hardware Specification | No | Our experiments employ GPT-3.5-turbo [1], Gemini-1.0-Pro [19] and Mistral-Nemo-Instruct-24073 as LM solvers. In ablation study, we use GPT-3.5-turbo if not specified. We use Open AI s textembedding-3-large4 as the encoder for dense retrieval. |
| Software Dependencies | No | Our experiments employ GPT-3.5-turbo [1], Gemini-1.0-Pro [19] and Mistral-Nemo-Instruct-24073 as LM solvers. In ablation study, we use GPT-3.5-turbo if not specified. We use Open AI s textembedding-3-large4 as the encoder for dense retrieval. For Table RAG, we set the cell encoding budget B = 10, 000 and the retrieval limit K = 5. For Rand Row Sampling and Row Col Retrieval, we increase the retrieval limit to K = 30. |
| Experiment Setup | Yes | Our experiments employ GPT-3.5-turbo [1], Gemini-1.0-Pro [19] and Mistral-Nemo-Instruct-24073 as LM solvers. In ablation study, we use GPT-3.5-turbo if not specified. We use Open AI s textembedding-3-large4 as the encoder for dense retrieval. For Table RAG, we set the cell encoding budget B = 10, 000 and the retrieval limit K = 5. For Rand Row Sampling and Row Col Retrieval, we increase the retrieval limit to K = 30. Each experiment is conducted 10 times and evaluated by majority-voting to ensure the stability and consistency. The evaluation metric is the exact-match accuracy if not specified. |