Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Relational Programming with Foundational Models
Authors: Ziyang Li, Jiani Huang, Jason Liu, Felix Zhu, Eric Zhao, William Dodds, Neelay Velingker, Rajeev Alur, Mayur Naik
AAAI 2024 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate VIEIRA on 9 challenging tasks that span language, vision, and structured and vector databases. Our evaluation shows that programs in VIEIRA are concise, can incorporate modern foundation models, and have comparable or better accuracy than competitive baselines. |
| Researcher Affiliation | Academia | University of Pennsylvania EMAIL, EMAIL, EMAIL, EMAIL, EMAIL, EMAIL, EMAIL, EMAIL, EMAIL |
| Pseudocode | No | The paper provides code snippets demonstrating the VIEIRA language and its foreign interface, but it does not include formal pseudocode blocks or algorithms for its internal workings or experimental procedures. |
| Open Source Code | Yes | Our framework, plugin library, and evaluations are open-source and available at https://github.com/scalloplang/scallop. |
| Open Datasets | Yes | Table 1 lists the datasets used, many of which are well-known public benchmarks with cited sources: Hotpot QA (Yang et al. 2018), CLUTRR (Sinha et al. 2019), GSM8K (Cobbe et al. 2021), Amazon ESCI (Reddy et al. 2022), GQA (Hudson and Manning 2019), CLEVR (Johnson et al. 2016). |
| Dataset Splits | No | The paper does not explicitly provide specific train/validation/test dataset splits (e.g., percentages, sample counts, or references to predefined splits) to reproduce the partitioning of the data. |
| Hardware Specification | No | The paper does not provide specific details about the hardware (e.g., CPU/GPU models, memory) used to run the experiments. |
| Software Dependencies | No | The paper mentions software components and models (e.g., Python, GPT, CLIP) but does not provide specific version numbers for these software dependencies required for reproducibility. |
| Experiment Setup | Yes | Our solution leverages GPT-4 (5-shot1) for extracting 3 relations: mentioned dates, duration between date labels, and the target date label. [...] Our solution for tracking shuffled objects relies on GPT-4 (1-shot) to extract 3 relations: initial possessions, swaps, and the target person whose final possessed object is expected as the answer. [...] Our solution to this task prompts GPT-4 (2-shot) to produce step-by-step expressions, which can contain constants, variables, and simple arithmetic operations. |