Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Towards counterfactual fairness through auxiliary variables

Authors: Bowei Tian, Ziyao Wang, Shwai He, Wanghao Ye, Guoheng Sun, Yucong Dai, Yongkai Wu, Ang Li

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our evaluation, conducted on synthetic and real-world datasets, validates EXOC s superiority, showing that it outperforms state-of-the-art approaches in achieving counterfactual fairness. Our code is available at https://github.com/CASE-Lab-UMD/ counterfactual_fairness_2025. 4 EXPERIMENTS 4.1 EXPERIMENT SETTINGS Baselines: To investigate the effectiveness of our framework in learning counterfactually fair predictors, we compare the proposed framework with multiple state-of-the-art methods. Evaluation Metrics: Generally speaking, the evaluation metrics consider two different aspects: prediction performance and counterfactual fairness. To measure the model prediction performance, we employ the widely used metrics Root Mean Square Error (RMSE) (Chai et al., 2014) and Mean Absolute Error (MAE) (Yuan, 2022) for regression tasks and accuracy for classification tasks. 4.2 BASELINE STUDY Baselines on synthetic datasets: For a better measurement of counterfactual fairness, we generate a synthetic dataset for each real-world dataset. Table 1: The comparison on synthetic datasets among Constant, Full, Unaware, Fair-K (Kusner et al., 2017), CLAIRE (Ma et al., 2023) and EXOC (Ours) on Law school (Krueger et al., 2021) and Adult (Becker & Kohavi, 1996) dataset. 4.3 ABLATION STUDY We perform an ablation study on γ, shown in Tab. 3. This experiment evaluates the effect of controlling fairness-accuracy balance, running on synthetic datasets.
Researcher Affiliation Academia Bowei Tian1, Ziyao Wang1, Shwai He1, Wanghao Ye1 Guoheng Sun1, Yucong Dai2, Yongkai Wu2 , Ang Li1 1University of Maryland, College Park, 2Clemson University EMAIL EMAIL, EMAIL
Pseudocode No The paper describes methods and formulas but does not include any clearly labeled pseudocode or algorithm blocks.
Open Source Code Yes Our code is available at https://github.com/CASE-Lab-UMD/ counterfactual_fairness_2025.
Open Datasets Yes Table 1: The comparison on synthetic datasets among Constant, Full, Unaware, Fair-K (Kusner et al., 2017), CLAIRE (Ma et al., 2023) and EXOC (Ours) on Law school (Krueger et al., 2021) and Adult (Becker & Kohavi, 1996) dataset. Table 2: The comparison on real-world datasets among Constant, Full, Unaware, Fair-K (Krueger et al., 2021), CLAIRE (Ma et al., 2023) and EXOC (Ours) on Law school (Krueger et al., 2021) and Adult (Becker & Kohavi, 1996) dataset.
Dataset Splits No The paper mentions generating synthetic datasets and details about counterfactual generation are in Appendix C, but it does not specify explicit training, validation, or test dataset splits for the experiments in the main text. Appendix C is not provided.
Hardware Specification No The paper does not provide specific hardware details (e.g., GPU/CPU models, memory) used for running its experiments.
Software Dependencies No The paper mentions "Details about implementations, including datasets, environments, and hyper-parameters, are in Appendix B." and cites PyTorch in its references, but it does not specify particular software dependencies with version numbers for reproducibility in the main text. Appendix B is not provided.
Experiment Setup Yes For baselines Full, Unaware, and Counterfactual Fairness Predictors, we use linear regression for regression and logistic regression for classification. We perform an ablation study on γ, shown in Tab. 3. This experiment evaluates the effect of controlling fairness-accuracy balance, running on synthetic datasets. The results show that as γ increases from 1 to 2, we can observe the performance gradually decreases, but the counterfactual fairness gradually increases. Also, we observe a better fairness-accuracy tradeoff, i.e., an increased fairness without sacrificing much accuracy. We attribute this to the introduction of the auxiliary node S , which serves as intrinsic information capable of deducing S. The result aligns with our theoretical analysis in Section 3.3, where S node and the custom loss Lc(S ,S ) can control the fairness-accuracy tradeoff. We observe that when γ = 1.2, there is generally an excellent balance between accuracy and fairness. Therefore, we set γ = 1.2 in our experiments.