Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

FLAME: A Fast Large-scale Almost Matching Exactly Approach to Causal Inference

Authors: Tianyu Wang, Marco Morucci, M. Usaid Awan, Yameng Liu, Sudeepa Roy, Cynthia Rudin, Alexander Volfovsky

JMLR 2021 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this section, we study the quality and scalability of FLAME on synthetic and real data. The real datasets we use are the US Census 1990 dataset from the UCI Machine Learning Repository (Lichman, 2013) and the US 2010 Natality data (National Center for Health Statistics, NCHS, 2010). The bit-vector and SQL implementations are referred to as FLAME-bit and FLAME-db respectively. FLAME-bit was implemented using Python 2.7.13 and FLAME-db using Python, SQL, and Microsoft SQL Server 2016. We compared FLAME with several other (matching and non-matching) methods including: (1) one-to-one Propensity Score Nearest Neighbor Matching (1-PSNNM) (Ross et al., 2015), (2) 1-PSNNM with oracle variable selection, (3) Genetic Matching (Gen Match) (Diamond and Sekhon, 2013), (4) Causal Forest (Wager and Athey, 2018), (5) Mahalanobis Matching, (6) double linear regression, (7) BART (Chipman et al., 2010) and (8) CTMLE (Van Der Laan and Rubin, 2006). ... The computation time results are in Table 2a. The experiments were conducted on a Windows 10 machine with Intel(R) Core(TM) i7-6700 CPU processor (4 cores, 3.40GHz, 8M) and 32GB RAM.
Researcher Affiliation	Academia	Tianyu Wang EMAIL Marco Morucci EMAIL M. Usaid Awan EMAIL Yameng Liu EMAIL Sudeepa Roy EMAIL Cynthia Rudin EMAIL Alexander Volfovsky EMAIL Duke University
Pseudocode	Yes	Algorithm 1 : FLAME Algorithm Inputs Input data Sma p X, Y, Tq for matching; training set Str p Xtr, Y tr, T trq; model classes F1, F2, , Fd; stopping threshold ϵ; tradeoﬀparameter C. Outputs A sequence of selection indicators θ0, , θd, and a set of matched groups t MGpθl, Slqulě1. Ź Sl is deﬁned in the algorithm. 1: Initialize S0 Sma p X, Y, Tq, θ0 1dˆ1, l 1, run True. Ź l is the index for iterations. 2: Compute exact matched groups MGpθ0, S0q as deﬁned in (1). Ź The detailed implementation is in Section 4. 3: while run True do 4: Compute θl using (6) on training set Str, using Fd l and tradeoﬀparameter C. Ź Determine which covariates to match on for this iteration. 5: Compute matched groups MGpθl 1, Sl 1q as deﬁned in (1). Ź The detailed implementation is in Section 4. 6: Sl Sl 1z MGpθl 1, Sl 1q. Ź These matched units are done. 7: if ˆPEFd lpθl, Strq ą ˆPEFdp1dˆ1, Strq ϵ OR Sl H then 8: run False Ź Prediction error is too high to continue matching. 10: Output tθl, MGpθl, Slqulě1.
Open Source Code	Yes	The code for FLAME is available at https://cran.r-project.org/web/packages/ FLAME/index.html (in R), and https://www.github.com/almost-matching-exactly/ (in Python). An introduction to the project with links to the code is also found at https: //almost-matching-exactly.github.io/. Gupta et al. (2021) provides a short overview of the FLAME-DAME software package.
Open Datasets	Yes	The real datasets we use are the US Census 1990 dataset from the UCI Machine Learning Repository (Lichman, 2013) and the US 2010 Natality data (National Center for Health Statistics, NCHS, 2010).
Dataset Splits	Yes	We consider 20,000 units (10,000 control and 10,000 treated) generated with (9)... After eliminating units whose outcomes and/or treatment indicators are missing, there were 2.1M units, among which 75K units are treated units. ... We randomly sampled 10% of these units (122,089 units) as the training set.
Hardware Specification	Yes	The experiments were conducted on a Windows 10 machine with Intel(R) Core(TM) i7-6700 CPU processor (4 cores, 3.40GHz, 8M) and 32GB RAM.
Software Dependencies	Yes	FLAME-bit was implemented using Python 2.7.13 and FLAME-db using Python, SQL, and Microsoft SQL Server 2016. ... The other implementation leverages database management systems (e.g., Postgre SQL, 2016)
Experiment Setup	Yes	Four of the simulated experiments use data generated from special cases of the following (treatment T P t0, 1u): i 1 βixi T U ÿ 1ďiăγď5 xixγ ϵ, (9) Here, αi Np10s, 1q with s Uniformt 1, 1u, βi Np1.5, 0.15q, U is a constant and ϵ Np0, 0.1q. This contains linear baseline eﬀects and treatment eﬀects, and a quadratic treatment eﬀect term. ... We consider 20,000 units (10,000 control and 10,000 treated) generated with (9) where U 1. ... In this experiment, Figure 1a is generated by stopping when the PE drops (starting from values within [-2, -1]) to below -20, resulting in more than 15,000 matches out of 20,000 units.