Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

LeapFactual: Reliable Visual Counterfactual Explanation Using Conditional Flow Matching

Authors: Zhuo Cao, Xuan Zhao, Lena Krieger, Hanno Scharr, Ira Assent

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We provide extensive experiments on benchmark and real-world datasets highlighting that LEAPFACTUAL generates accurate and in-distribution counterfactual explanations that offer actionable insights. We observe, for instance, that our reliable counterfactual samples with labels aligning to ground truth can be beneficially used as new training data to enhance the model. The proposed method is diversely applicable and enhances scientific knowledge discovery as well as non-expert interpretability. The code is available on https://github.com/caicairay/Leap Factual.
Researcher Affiliation	Academia	1IAS-8, Forschungszentrum Jülich, Germany 2Munich Center for Machine Learning (MCML), LMU Munich, Germany 3Aarhus University, Denmark EMAIL
Pseudocode	Yes	Algorithm 1 LEAP Input: Flow matching model vψ trained with CE-CFM objective, source point zsource (at t=1), current label yc, target label ˆyc, step sizes γlift, γland Step 1: Lift From Z1 to Z0 zyc(t) = zsource + R t 1 γliftvψ τ, zyc(τ), yc dτ, t [0, 1]. zlift zyc(0) Step 2: Land From Z0 to Z1 z ˆyc(t) = zlift + R t 0 γlandvψ τ, z ˆyc(τ), ˆyc dτ, t [0, 1]. zland z ˆyc(1) Output: Transported point zland Algorithm 2 LEAPFACTUAL Input: Source point x, target label ˆyc, classifier fθ, generative model gϕ, flow matching model vψ trained with CE-CFM objective Hyperparameters: Number of blending leaps Nb, blending step size γb, number of injection leaps Ni, injection step sizes γi, lift < γi,land Step 1: Preprocessing z g 1 ϕ (x) Acquiring z via VAE encoder or GAN inversion yc fθ(gϕ(z)) Determining the current label Step 2: Information Blending Generating CE for j = 0 to Nb 1 do z LEAP(vψ, z, yc, ˆyc, γb, γb) Blending the source and target classes information yc fθ(gϕ(z)) end for Step 3: (Optional) Information Injection Generating Reliable CE for j = 0 to Ni 1 do z LEAP(vψ, z, yc, ˆyc, γi,lift, γi,land) Injecting the target class information yc fθ(gϕ(z)) end for Step 4: Postprocessing x CE = gϕ(z) Output: Transported point x CE
Open Source Code	Yes	The code is available on https://github.com/caicairay/Leap Factual.
Open Datasets	Yes	We provide extensive experiments on benchmark and real-world datasets highlighting that LEAPFACTUAL generates accurate and in-distribution counterfactual explanations that offer actionable insights. We observe, for instance, that our reliable counterfactual samples with labels aligning to ground truth can be beneficially used as new training data to enhance the model. The proposed method is diversely applicable and enhances scientific knowledge discovery as well as non-expert interpretability. The code is available on https://github.com/caicairay/Leap Factual. 4.1 Quantitative Assessment In this section, we compare the performance of our method against two competitors: the Opt-based and CGM-based methods. Morpho-MNIST [48] provides a modified version of MNIST specifically designed to benchmark representation learning. 4.2 Model Improvement We show the advantages of reliable CEs in model training on the Galaxy10 DECa LS dataset [49, 50], a 10-class galaxy morphology classification task. 4.3 Generalization Table 4: Quantitative results for FFHQ using 1,024 samples. Last row reports results for randomly paired images. Appendix B.1.3 Model Improvement This experiment deals with Model improvement. Dataset The Galaxy10 DECa LS dataset is publicly available: Galaxy10 DECa LS. It includes around 18,000 colored images and deals with a 10-classes galaxy morphology classification task. The dataset is split into training and test sets with a fraction of 90% and 10%. The pre-processing includes random rotation augmentation, cropping to the center 150 150 pixels, and resizing to 128 128.
Dataset Splits	Yes	Dataset The Galaxy10 DECa LS dataset is publicly available: Galaxy10 DECa LS. It includes around 18,000 colored images and deals with a 10-classes galaxy morphology classification task. The dataset is split into training and test sets with a fraction of 90% and 10%. The pre-processing includes random rotation augmentation, cropping to the center 150 150 pixels, and resizing to 128 128.
Hardware Specification	Yes	We performed our experiments on a single node of a GPU server, which includes one NVIDIA A100 with 80GB of VRAM, and an AMD EPYC 7742 with 1TB RAM shared with the other nodes of the server.
Software Dependencies	No	The paper mentions several software components like Adam optimizer, VAE, GAN, PyTorch (implicitly through torchmetrics), CLIP model, Style GAN3, and U-Net. However, it does not specify explicit version numbers for these software dependencies, which is required for a 'Yes' answer. For example, it mentions 'torchmetrics' but not its version.
Experiment Setup	Yes	Opt-based Method For each batch of input, we optimize Equation (1) using the Adam optimizer with a learning rate of 0.2 for 1,000 epochs. The hyperparameter λ is set to 0.0006 to mitigate gradient vanishing. Leap Factual The architecture of the flow network is defined as follows: Linear(32 + 1 + 10, 64) Si LU Linear(64, 64) Si LU Linear(64, 64) Si LU Linear(64, 32), where the input dimensions 32, 1, and 10 correspond to the latent vector from the VAE, a time conditioning variable, and a one-hot encoded class label, respectively. The flow matching noise parameter σ is set to 0. The model is trained using the Adam optimizer with a learning rate of 0.005 and a batch size of 256 for 30 epochs, taking approximately 80 seconds to complete. For this experiment, the hyperparameters of LEAPFACTUAL are set as follows: γb = 0.1 and Nb = 15 for blending, and γi, lift = 0, γi, land = 0.1, and Ni = 5 for injection. We train a weak classifier on 20% of the dataset and use a second VGG model architecture trained on 100% as baseline. We generate standard and reliable CEs (depicted in Figure 6) regarding all classes other than the original for each image in the training subset, resulting in two auxiliary datasets assuming CE target labels as ground truth. We then blend varying fractions of these auxiliary datasets with the original training subset, used to train the weak classifier (20%), to evaluate their impact on the models classification performance. We train a 1D U-Net [56] for 120 epochs. A total of 20K images are randomly sampled from Style GAN3 and projected into the w-space [51], which serves as input. The corresponding predicted labels from the CLIP model are used as conditional inputs. To train the model, we first sample 20,000 random latent vectors from a Gaussian distribution and map them to the w-space as the input. Then we use the classifier output of corresponding images as the condition of the U-Net. We use Adam optimizer to train the model for 120 epochs with a batch size of 32, a learning rate of 2 10 4, and a weight decay of 1 10 5. Training is performed on a single NVIDIA A100 GPU and takes approximately 22 hours. The hyperparameters for the blending parameters are γb = 0.8 and Nb = 10. For information injection, we set γi,lift = 0.8, γi,land = 0.83, and Ni = 5. The flow matching σ is set to 10 4.