Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Chain-of-Table: Evolving Tables in the Reasoning Chain for Table Understanding
Authors: Zilong Wang, Hao Zhang, Chun-Liang Li, Julian Martin Eisenschlos, Vincent Perot, Zifeng Wang, Lesly Miculicich, Yasuhisa Fujii, Jingbo Shang, Chen-Yu Lee, Tomas Pfister
ICLR 2024 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate the proposed CHAIN-OF-TABLE on three public table understanding benchmarks: Wiki TQ (Pasupat & Liang, 2015), Fe Ta QA (Nan et al., 2022), and Tab Fact (Chen et al., 2019). We conduct our experiments using Pa LM 2 (Anil et al., 2023) and GPT-3.5 (Brown et al., 2020; Open AI, 2023) as the backbone LLMs. |
| Researcher Affiliation | Collaboration | 1University of California, San Diego 2Google Cloud AI Research 3Google Research |
| Pseudocode | Yes | Algorithm 1: CHAIN-OF-TABLE Prompting |
| Open Source Code | No | The paper states: 'We run Text-to-SQL and Binder using the official open-sourced code and prompts in https://github.com/HKUNLP/Binder. We run Dater using the official open-sourced code and prompts in https://github.com/ Alibaba Research/DAMO-Conv AI.' This refers to the code for baseline methods, not the code for the CHAIN-OF-TABLE framework itself. |
| Open Datasets | Yes | We evaluate the proposed CHAIN-OF-TABLE on three public table understanding benchmarks: Wiki TQ (Pasupat & Liang, 2015), Fe Ta QA (Nan et al., 2022), and Tab Fact (Chen et al., 2019). |
| Dataset Splits | Yes | We evaluate the proposed CHAIN-OF-TABLE on three public table understanding benchmarks: Wiki TQ (Pasupat & Liang, 2015), Fe Ta QA (Nan et al., 2022), and Tab Fact (Chen et al., 2019). We incorporate few-shot demo samples from the training set into the prompts to perform in-context learning. We guarantee that all demo samples are from the training set so they are unseen during testing. |
| Hardware Specification | No | The paper mentions using 'Pa LM 2' and 'GPT-3.5' as backbone LLMs but does not provide specific details about the hardware (e.g., GPU models, CPU types, memory) used for running the experiments. |
| Software Dependencies | No | The paper mentions using 'Pa LM 2-S1' and 'GPT 3.5 (turbo-16k-0613)2' as backbone LLMs, which are specific models, but it does not list other software dependencies (e.g., programming languages, libraries, frameworks) with version numbers. |
| Experiment Setup | Yes | We report the parameters and demo sample numbers we used in CHAIN-OF-TABLE in Table 7, 8 and 9. Overall, we annotate 29 samples and use them across different datasets. There are a large overlapping between the usage on different functions. For example, we use the same demo sample to introduce how to use f_add_column in the function Dynamic Plan across different datasets. We guarantee that all demo samples are from the training set so they are unseen during testing. |