Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Towards counterfactual fairness through auxiliary variables

Authors: Bowei Tian, Ziyao Wang, Shwai He, Wanghao Ye, Guoheng Sun, Yucong Dai, Yongkai Wu, Ang Li

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our evaluation, conducted on synthetic and real-world datasets, validates EXOC s superiority, showing that it outperforms state-of-the-art approaches in achieving counterfactual fairness. Our code is available at https://github.com/CASE-Lab-UMD/ counterfactual_fairness_2025. 4 EXPERIMENTS 4.1 EXPERIMENT SETTINGS Baselines: To investigate the effectiveness of our framework in learning counterfactually fair predictors, we compare the proposed framework with multiple state-of-the-art methods. Evaluation Metrics: Generally speaking, the evaluation metrics consider two different aspects: prediction performance and counterfactual fairness. To measure the model prediction performance, we employ the widely used metrics Root Mean Square Error (RMSE) (Chai et al., 2014) and Mean Absolute Error (MAE) (Yuan, 2022) for regression tasks and accuracy for classification tasks. 4.2 BASELINE STUDY Baselines on synthetic datasets: For a better measurement of counterfactual fairness, we generate a synthetic dataset for each real-world dataset. Table 1: The comparison on synthetic datasets among Constant, Full, Unaware, Fair-K (Kusner et al., 2017), CLAIRE (Ma et al., 2023) and EXOC (Ours) on Law school (Krueger et al., 2021) and Adult (Becker & Kohavi, 1996) dataset. 4.3 ABLATION STUDY We perform an ablation study on γ, shown in Tab. 3. This experiment evaluates the effect of controlling fairness-accuracy balance, running on synthetic datasets.
Researcher Affiliation	Academia	Bowei Tian1, Ziyao Wang1, Shwai He1, Wanghao Ye1 Guoheng Sun1, Yucong Dai2, Yongkai Wu2 , Ang Li1 1University of Maryland, College Park, 2Clemson University EMAIL EMAIL, EMAIL
Pseudocode	No	The paper describes methods and formulas but does not include any clearly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	Our code is available at https://github.com/CASE-Lab-UMD/ counterfactual_fairness_2025.
Open Datasets	Yes	Table 1: The comparison on synthetic datasets among Constant, Full, Unaware, Fair-K (Kusner et al., 2017), CLAIRE (Ma et al., 2023) and EXOC (Ours) on Law school (Krueger et al., 2021) and Adult (Becker & Kohavi, 1996) dataset. Table 2: The comparison on real-world datasets among Constant, Full, Unaware, Fair-K (Krueger et al., 2021), CLAIRE (Ma et al., 2023) and EXOC (Ours) on Law school (Krueger et al., 2021) and Adult (Becker & Kohavi, 1996) dataset.
Dataset Splits	No	The paper mentions generating synthetic datasets and details about counterfactual generation are in Appendix C, but it does not specify explicit training, validation, or test dataset splits for the experiments in the main text. Appendix C is not provided.
Hardware Specification	No	The paper does not provide specific hardware details (e.g., GPU/CPU models, memory) used for running its experiments.
Software Dependencies	No	The paper mentions "Details about implementations, including datasets, environments, and hyper-parameters, are in Appendix B." and cites PyTorch in its references, but it does not specify particular software dependencies with version numbers for reproducibility in the main text. Appendix B is not provided.
Experiment Setup	Yes	For baselines Full, Unaware, and Counterfactual Fairness Predictors, we use linear regression for regression and logistic regression for classification. We perform an ablation study on γ, shown in Tab. 3. This experiment evaluates the effect of controlling fairness-accuracy balance, running on synthetic datasets. The results show that as γ increases from 1 to 2, we can observe the performance gradually decreases, but the counterfactual fairness gradually increases. Also, we observe a better fairness-accuracy tradeoff, i.e., an increased fairness without sacrificing much accuracy. We attribute this to the introduction of the auxiliary node S , which serves as intrinsic information capable of deducing S. The result aligns with our theoretical analysis in Section 3.3, where S node and the custom loss Lc(S ,S ) can control the fairness-accuracy tradeoff. We observe that when γ = 1.2, there is generally an excellent balance between accuracy and fairness. Therefore, we set γ = 1.2 in our experiments.