Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Bootstrapping Self-Improvement of Language Model Programs for Zero-Shot Schema Matching

Authors: Nabeel Seedat, Mihaela Van Der Schaar

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Empirically, we demonstrate on real-world medical schema matching benchmarks that Matchmaker outperforms previous ML-based approaches, highlighting its potential to accelerate data integration and interoperability of ML-ready data. We conduct experiments on the MIMIC-OMOP and Synthea-OMOP datasets, which are the standard benchmark datasets used in prior schema matching works (Sheetrit et al., 2024; Zhang et al., 2023b; Narayan et al., 2022; Zhang et al., 2023a; 2021). These datasets are real-world healthcare schema matching datasets and have been widely adopted due to their complexity and their reflection of real-world schema matching challenges.
Researcher Affiliation	Collaboration	Nabeel Seedat 1 2 Mihaela van der Schaar 1 1Department of Applied Mathematics and Theoretical Physics, University of Cambridge 2Foundational Machine Learning Research, Thomson Reuters. Correspondence to: Nabeel Seedat <EMAIL>.
Pseudocode	Yes	Algorithm 1 Optimize LM program L 0: Input: Set of evaluation queries Deval = e1, e2, . . . , en 0: Output: Set of top n demonstrations Ddemo ... Algorithm 3 Matchmaker: Schema Matching with Self-Improving Compositional Language Model Programs Require: Source schema Ss, Target schema St Ensure: Schema matches M
Open Source Code	Yes	2https://github.com/seedatnabeel/Matchmaker or https://github.com/vanderschaarlab/Matchmaker
Open Datasets	Yes	We conduct experiments on the MIMIC-OMOP and Synthea-OMOP datasets, which are the standard benchmark datasets used in prior schema matching works (Sheetrit et al., 2024; Zhang et al., 2023b; Narayan et al., 2022; Zhang et al., 2023a; 2021). ... Open-source data: https://github.com/meni Data1/MIMIC_2_OMOP ... Open-source data: https://github.com/JZCS2018/SMAT/tree/main/datasets/omap/
Dataset Splits	Yes	Note there is no specific train-test sets used as in supervised learning. As we perform the schema matching task in a zero-shot manner. ... In our experiments, we assess two variants given that labeled training data for schema matching is hard to access: (i) 20-80: 20% train and 80% test and (ii) 50-50: 50% train and 50% test.
Hardware Specification	Yes	All experiments are run on a single Nvidia A4000 GPU with 20 GB of vram.
Software Dependencies	Yes	The model version used as the LLM was GPT-4-1106, with the following settings: ... We use Colbert-V2 (Santhanam et al., 2022) as the embedding model ... All LLM baselines use GPT-4 (0613) (Open AI, 2023) as the backbone for fair comparison to the original works and to isolate the gains of the system not tied to the LLM.
Experiment Setup	Yes	GPT-4 Hyper-parameters. The model version used as the LLM was GPT-4-1106, with the following settings: { temperature : 0.5, max_tokens : 1024, top_p : 1, frequency_penalty : 0, presence_penalty : 0, n : 1, }