Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Fair Data Representation for Machine Learning at the Pareto Frontier
Authors: Shizhou Xu, Thomas Strohmer
JMLR 2023 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Numerical simulations underscore the advantages: (1) the pre-processing step is compositive with arbitrary conditional expectation estimation supervised learning methods and unseen data; (2) the fair representation protects the sensitive information by limiting the inference capability of the remaining data with respect to the sensitive data; (3) the optimal aļ¬ne maps are computationally eļ¬cient even for high-dimensional data. |
| Researcher Affiliation | Academia | Shizhou Xu EMAIL Department of Mathematics University of California Davis Davis, CA 95616-5270, USA Thomas Strohmer EMAIL Department of Mathematics Center of Data Science and Artiļ¬cial Intelligence Research University of California Davis Davis, CA 95616-5270, USA |
| Pseudocode | Yes | Algorithm 1: Pseudo-Barycenter Geodesics for Independent Variable Algorithm 2: Dependent (or Post-processing) Pseudo-Barycenter Geodesics |
| Open Source Code | Yes | The code for the results of our experiments is available online at: github.com/xushizhou/fair_data_ representation |
| Open Datasets | Yes | We adopt the following metrics of accuracy and discrimination that are frequently used in fair machine learning experiments on various data sets: (1) For fair classiļ¬cation, the prediction accuracy, and statistical disparity are quantiļ¬ed respectively by AUC (area under the Receiver Operator Characteristic curve) and Discrimination = max z,z Z P( ĖYz = 1) P( ĖYz = 1) 1 as deļ¬ned in [13]. (2) For univariate supervised learning, the prediction error and statistical disparity are quantiļ¬ed respectively by MSE (mean squared error, equivalent to the squared L2 norm on sample probability space) and KS (Kolmogorov-Smirnov) distance as in [18] for indirect comparison purpose. So that readers can compare the proposed methods indirectly with other methods that are tested in [13, 18, 44] and their references. (3) For univariate and multivariate supervised learning, the prediction error and statistical disparity are quantiļ¬ed respectively by L2 and W2 (Wasserstein) distances, which are the quantiļ¬cation the current work adopts to prove the Pareto frontier in the above sections. In addition, we perform tests on four benchmark data sets: CRIME, LSAC, Adult, COMPAS, which are also frequently used in fair learning experiments. |
| Dataset Splits | Yes | For all the test results, we apply 5-fold cross-validation with 50% training and 50% testing split, except for 90% training and 10% testing split in the linear regression test on LSAC due to the high computational cost of the post-processing Wasserstein barycenter method [18]. |
| Hardware Specification | Yes | As shown in the table above, the computational cost of the pseudo-barycenter method is signiļ¬cantly lower than the cost of the known post-processing methods: on average 7836 times faster on LSAC and 21 times faster on CRIME in a single train-test cycle for a single supervised learning model. Furthermore, in model selection or composition, the pre-processing time is a ļ¬xed one-time cost while the post-processing time is additive. (See point 4 below for a more detailed explanation) Now, we show the major advantages of the proposed method compared to the postprocessing ones, such as [18, 28, 24]: Acknowledgement The authors want to thank the referees, whose profound and detailed feedback greatly enhanced the quality and clarity of this paper. The authors acknowledge support from NSF DMS-2027248, NSF DMS-2208356, NSF CCF-1934568, NIH R01HL16351, and DESC0023490. Appendix A. Appendix: Proof of Results in Section 2 A.1 Proof of Lemma 2.1 W2 2(µ, ν) = Z ||x y||2dγ (x, y) = Z ||((x mµ) (y mν)) + (mµ mν)||2dγ (x, y) = Z ||(x mµ) (y mν)||2dγ (x, y) + ||mµ mν||2 W2 2(µ , ν ) + ||mµ mν||2 = Z ||x y||2d(γ ) (x, y) + ||mµ mν||2 = Z ||(x + mµ) (y + mν)||2d(γ ) (x, y) where γ and (γ ) denote the optimal transport plan for (µ, ν) and (µ , ν ), respectively. The ļ¬rst inequality results from the fact that γ (x, y) := γ (x mµ, y mν) Q(µ , ν ), the second inequality from γ(x, y) := (γ ) (x + mµ, y + mν) Q(µ, ν), and the equalities from direct expansion. A.2 Proof of Lemma 2.3 Proof Existence and uniqueness follow directly from Theorem 2.1. For the equivalent multi-marginal coupling problem, there exists an optimal solution γ = L({Xz}z). It follows from Remark 2.3 that X = T({Xz}z) where L( X) is the Wasserstein barycenter. Therefore, the Gaussianity of barycenter results from linearity of T in the ļ¬nite |Z| case, and the fact that the set of Gaussian distribution is closed in (P2,ac, W2) when |Z| is inļ¬nite. The Fair Data Representation for Machine Learning at the Pareto Frontier characterization equation is proved in the case of ļ¬nite |Z| in [2]. For inļ¬nite |Z|, the equation still holds due to the continuity of the covariance function on (P2,ac, W2). The suļ¬ciency and necessity of the equation follows from the following characterization of the barycenter via Brenier s maps {T XXz}z derived in [2]: Z Z T XXzdĪ»(z) = Id. (96) It follows from the explicit form of {T XXz}z in Lemma 2.2 that Z Z T XXzdĪ»(z) = Z 1 2 X) 1 2 Ī£ 1 2 X dĪ»(z) = Id 1 2 X) 1 2 dĪ»(z)Ī£ 1 1 2 X) 1 2 dĪ»(z) = Ī£ X. Appendix B. Appendix: Proof of Results in Section 4 B.1 Proof of Lemma 4.1 Proof First, it follows from the triangle inequality that W2(µ0, µ1) W2(µ0, µs) + W2(µs, µt) + W2(µt, µ1) for any s, t [0, 1]. On the other hand, it follows from the deļ¬nition of µt that for s, t [0, 1] W2 2(µs, µt) Z (Rd)2 ||x y||2d(Ļs) γ(x) d(Ļt) γ(y) (Rd)2 ||Ļs(x, y) Ļt(x, y)||2dγ(x, y) (Rd)2 ||(1 s)x + sy (1 t)x ty||2dγ(x, y) (Rd)2 ||(t s)x (t s)y||2dγ(x, y) (Rd)2 ||x y||2dγ(x, y) = |t s|2W2 2(µ0, µ1), where the ļ¬rst equation results from deļ¬nition of W2. Given the above two facts, we complete the proof by contradiction. Assume s, t [0, 1] such that W2(µs, µt) < |t s|W2(µ0, µ1), then W2(µ0, µ1) W2(µ0, µs) + W2(µs, µt) + W2(µt, µ1) < |s|W2(µ0, µ1) + |t s|W2(µ0, µ1) + |1 t|W2(µt, µ1) = W2(µ0, µ1). Fair Data Representation for Machine Learning at the Pareto Frontier B.2 Proof of Theorem 4.1 Proof First, we derive the inequality from the triangle inequality and the optimality of {T( , z)}z: Let f : X Z Y be an arbitrary measurable function. It follows that Z ||E(Y |X, Z)z f(X, Z)z||2 2dĪ»(z)) 1 2 L(f(X, Z)) + ( Z Z ||f(X, Z)z f(X, Z)z||2 2dĪ»(z)) 1 2 L(f(X, Z)) + ( Z Z W2 2(L(f(X, Z)z), L(f(X, Z)z))dĪ»(z)) 1 2 = L(f(X, Z)) + (1 Z2 W2 2(L(f(X, Z)z1), L(f(X, Z)z2))dĪ»(z1)dĪ»(z2)) 1 2 = L(f(X, Z)) + 1 2D(f(X, Z)). Here, the penultimate equation results from the fact that, for any {νz}z P2,ac(Rd), Z Z2 W2 2(νz1, νz2)dĪ»(z1)dĪ»(z2) = 2 Z Z W2 2(νz, ν)dĪ»(z), (97) where ν is the Wasserstein barycenter of {νz}z. Now, we show that the lower bound is achieved if and only if f(X, Z) = T(t)(E(Y |X, Z), Z), t [0, 1]. Let t [0, 1], Tz := T( , z), and µz := L(E(Y |X, Z)z). It follows from Lemma 4.1 and Remark 4.1 that: Z W2 2(µz, µ)dĪ»(z)) 1 2 Z W2 2(µz, Tz(t) µz)dĪ»(z)) 1 2 + ( Z Z W2 2(Tz(t) µz, µ)dĪ»(z)) 1 2 Z W2 2(µz, µ)dĪ»(z)) 1 2 + ((1 t)2 Z Z W2 2(µz, µ)dĪ»(z)) 1 2 = t V + (1 t)V = V. Therefore, the second inequality is an equality where the ļ¬rst term is L(T(t)): L(T(t)) = ( Z Z ||E(Y |X, Z)z Tz(t)(E(Y |X, Z)z)||2 2dĪ»(z)) 1 2 Z W2 2(µz, Tz(t) µz)dĪ»(z)) 1 2 Z W2 2(µz, µ)dĪ»(z)) 1 2 = t V. Xu and Strohmer For the second term, we claim that it equals 1 2D(T(t)). To see this, we need to ļ¬rst show Tz(t) µz = µ. Indeed, if not, then R Z W2 2(Tz(t) µz, Tz(t) µz)dĪ»(z) is strictly less than R Z W2 2(Tz(t) µz, µ)dĪ»(z) by the deļ¬nition and uniqueness of Tz(t) µz. It follows that Z W2 2(µz, Tz(t) µz)dĪ»(z)) 1 2 Z W2 2(µz, Tz(t) µz)dĪ»(z)) 1 2 + ( Z Z W2 2(Tz(t) µz, Tz(t) µz)dĪ»(z)) 1 2 <L(T(t)) + ( Z Z W2 2(Tz(t) µz, µ)dĪ»(z)) 1 2 Z W2 2(µz, µ)dĪ»(z)) 1 2 , which contradicts the deļ¬nition and uniqueness of µ. Therefore, D(T(t)) = ( Z Z2 W2 2(Tz1(t) µz1, Tz2(t) µz2)dĪ»(z1)dĪ»(z2)) 1 2 Z W2 2(Tz(t) µz, Tz(t) µz)dĪ»(z)) 1 2 Z W2 2(Tz(t) µz, µ)dĪ»(z)) 1 2 Z W2 2(µz, µ)dĪ»(z)) 1 2 That completes the proof. Appendix C. Appendix: Proof of Results in Section 5 C.1 Proof of X Z implies E(Yz| X) = E(Y | X, Z)z Proof Let X Z and assume for contradiction that E(Yz| X) = E(Y | X, Z)z. Then, we have ||Y E(Y | X, Z)||2 2 = Z Z ||Yz f ( X, Z)z||2 2dĪ» Z Z ||Yz f ( X, z)||2 2dĪ» Z Z ||Yz E(Yz| X)||2 2dĪ» Z Z ||Yz fz( X)||2 2dĪ» Fair Data Representation for Machine Learning at the Pareto Frontier where the ļ¬rst line follows from disintegration and the fact that there exists a measurable function f : X Z Y such that f ( X, Z) = E(Y | X, Z), the second from X Z, the third line follows from orthogonal projection property of conditional expectation and the assumption, and the forth from the fact that there exists a measurable function fz : X Y such that fz( X) = E(Yz| X). Now, deļ¬ne f : X Z Y by f( , z) := fz for Ī»-a.e. z Z. It follows that ||Y E(Y | X, Z)||2 2 > Z Z ||Yz fz( X)||2 2dĪ» Z Z ||Yz f( X, z)||2 2dĪ» = ||Y f( X, Z)||2 2 = ||Y E(Y | X, Z)||2 2 + ||E(Y | X, Z) f( X, Z)||2 2. That implies ||E(Y | X, Z) f( X, Z)||2 2 < 0, a contradiction. This completes the proof. C.2 Proof of Lemma 5.1 Proof Let X, X { X DX : X Z} satisfy Ļ( X ) Ļ( X). We have ||E(Y |X, Z) E( Y | X, Z)||2 2 ||E(Y |X, Z) E( Y | X , Z)||2 2 =||E(Y |X, Z) E(Y | X, Z)||2 2 ||E(Y |X, Z) E(Y | X , Z)||2 2 Notice that ||E(Y |X, Z) E(Y | X, Z)||2 2 = ||E(Y |X, Z) E(Y | X, Z)||2 2 + Z Z W2 2(µz, µ)dĪ» where µz := L(E(Y | X, Z)z) and µ := L(E(Y | X, Z)). Also, we deļ¬ne µ z and µ analogously to have ||E(Y |X, Z) E(Y | X , Z)||2 2 =||E(Y |X, Z) E(Y | X , Z)||2 2 + Z Z W2 2(µ z, µ )dĪ» =||E(Y |X, Z) E(Y | X, Z)||2 2 + ||E(Y | X, Z) E(Y | X , Z)||2 2 + Z Z W2 2(µ z, µ )dĪ». Combining the above, we have ||E(Y |X, Z) E( Y | X, Z)||2 2 ||E(Y |X, Z) E( Y | X , Z)||2 2 Z W2 2(µz, µ)dĪ» Z Z W2 2(µ z, µ )dĪ» ||E(Y | X, Z) E(Y | X , Z)||2 2. It remains to show that R Z W2 2(µz, µ)dĪ» < R Z W2 2(µ z, µ )dĪ» + ||E(Y | X, Z) E(Y | X , Z)||2 2. Indeed, assume for contradiction that R Z W2 2(µ z, µ )dĪ» + ||E(Y | X, Z) E(Y | X , Z)||2 2 Xu and Strohmer Z W2 2(µz, µ)dĪ», then we have Z Z W2 2(µz, µ )dĪ» ||E(Y | X, Z) E(Y | X , Z)||2 2 + Z Z W2 2(µ z, µ )dĪ» Z W2 2(µz, µ)dĪ». This contradicts the optimality and uniqueness of µ by Lemma 3.1. Therefore, we prove by contradiction that R Z W2 2(µz, µ)dĪ» < R Z W2 2(µ z, µ )dĪ» + ||E(Y | X, Z) E(Y | X , Z)||2 2 and, hence, ||E(Y |X, Z) E( Y | X, Z)||2 2 ||E(Y |X, Z) E( Y | X , Z)||2 2 < 0. That completes the proof. C.3 Proof of Lemma 5.2 Proof We ļ¬rst prove Ļ(( X, Z)) = Ļ((X, Z)). Since L(Xz) P2,ac, it follows from Lemma 3.1 that there exists a measurable map T : X Z X such that T(Xz, z) = Xz Ī»-a.e., where X denotes the Wasserstein barycenter of {Xz}z. Deļ¬ne T Id|Z : X Z X Z, we have T Id|Z is X Z/X Z-measurable and satisļ¬es T Id|Z((X, Z)) = ( X, Z). That implies Ļ(( X, Z)) Ļ((X, Z)). Furthermore, since L( X) P2,ac, it follows from Brenier s theorem [11] that there exists T 1( , z) such that T 1( Xz, z) = Xz. Therefore, we have (T Id|Z) 1 = T 1 Id|Z is X Z/X Z-measurable and satisļ¬es (T Id|Z) 1(( X, Z)) = (X, Z). That implies Ļ((X, Z)) Ļ(( X, Z)). That completes the proof of Ļ(( X, Z)) = Ļ((X, Z)). Now, we show Ļ( X) Ļ( X). From the construction of X, we have Ļ(( X, Z)) Ļ(( X, Z)) = Ļ((X, Z)). But X Z implies that, for any BX BX , we can construct BX Z BX BZ. In addition, due to Ļ(( X, Z)) Ļ(( X, Z)), there exists B XZ BX BZ such that ( X, Z) 1(B XZ) = (X, Z) 1(BX Z). Lastly, X Z also implies that there exists B X BX satisfying B XZ = B X Z. It follows that X 1(BX) = ( X, Z) 1(BX Z) = (X, Z) 1(B X Z) = X 1(B X) (98) Since our choice of BX BX is arbitrary, it follows that Ļ( X) Ļ( X). Finally, since our choice of X { X D|X : X Z} is arbitrary, we are done. [1] B. L. Adamson. Ricci v. De Stefano: Procedural Activism (?). National Black Law Journal (University of California, Los Angeles), 24:11 01, 2011. [2] M. Agueh and G. Carlier. Barycenters in the Wasserstein space. SIAM Journal on Mathematical Analysis, 43(2):904 924, 2011. [3] J. M. Altschuler and E. Boix-Adsera. Wasserstein barycenters are NP-hard to compute. SIAM Journal on Mathematics of Data Science, 4(1):179 203, 2022. Fair Data Representation for Machine Learning at the Pareto Frontier [4] P. C. Alvarez-Esteban, E. Del Barrio, J. Cuesta-Albertos, and C. Matr an. A ļ¬xedpoint approach to barycenters in Wasserstein space. Journal of Mathematical Analysis and Applications, 441(2):744 762, 2016. [5] J. Angwin, J. Larson, S. Mattu, and L. Kirchner. Machine Bias. In Ethics of data and analytics, pages 254 264. Auerbach Publications, 2022. [6] J. B. Aristotle et al. The complete works of Aristotle, volume 2. Princeton University Press Princeton, 1984. [7] A. Asuncion and D. Newman. UCI machine learning repository, 2007. [8] R. Berk, H. Heidari, S. Jabbari, M. Joseph, M. Kearns, J. Morgenstern, S. Neel, and A. Roth. A convex framework for fair regression. ar Xiv preprint ar Xiv:1706.02409, 2017. [9] R. Bhatia. Positive Deļ¬nite Matrices. Princeton University Press, 2009. [10] A. W. Blumrosen. Strangers in paradise: Griggs v. Duke Power Co. and the concept of employment discrimination. Mich. L. Rev., 71:59, 1972. [11] Y. Brenier. Polar factorization and monotone rearrangement of vector-valued functions. Communications on pure and applied mathematics, 44(4):375 417, 1991. [12] T. Calders and I. ĖZliobait e. Why unbiased computational processes can lead to discriminative decision procedures. In Discrimination and Privacy in the Information Society: Data mining and proļ¬ling in large databases, pages 43 57. Springer, 2013. [13] F. Calmon, D. Wei, B. Vinzamuri, K. Natesan Ramamurthy, and K. R. Varshney. Optimized pre-processing for discrimination prevention. Advances in neural information processing systems, 30, 2017. [14] Y. Cao and J. Yang. Towards making systems forget with machine unlearning. In 2015 IEEE symposium on security and privacy, pages 463 480. IEEE, 2015. [15] G. Carlier and I. Ekeland. Matching for teams. Economic theory, 42:397 418, 2010. [16] A. Chouldechova and A. Roth. The frontiers of fairness in machine learning. ar Xiv preprint ar Xiv:1810.08810, 2018. [17] B. Christian. The alignment problem: Machine learning and human values. WW Norton & Company, 2020. [18] E. Chzhen, C. Denis, M. Hebiri, L. Oneto, and M. Pontil. Fair regression with Wasserstein barycenters. Advances in Neural Information Processing Systems, 33:7321 7331, 2020. [19] J. M. Cooper, D. S. Hutchinson, et al. Plato: complete works. Hackett Publishing, 1997. Xu and Strohmer [20] J. A. Cuesta-Albertos, C. Matr an-Bea, and A. Tuero-Diaz. On lower bounds for the l2Wasserstein metric in a Hilbert space. Journal of Theoretical Probability, 9(2):263 283, 1996. [21] C. Dwork, M. Hardt, T. Pitassi, O. Reingold, and R. Zemel. Fairness through awareness. In Proceedings of the 3rd innovations in theoretical computer science conference, pages 214 226, 2012. [22] I. Ekeland. Existence, uniqueness and eļ¬ciency of equilibrium in hedonic markets with multidimensional types. Economic Theory, 42:275 315, 2010. [23] M. Feldman, S. A. Friedler, J. Moeller, C. Scheidegger, and S. Venkatasubramanian. Certifying and removing disparate impact. In proceedings of the 21th ACM SIGKDD international conference on knowledge discovery and data mining, pages 259 268, 2015. [24] T. L. Gouic, J.-M. Loubes, and P. Rigollet. Projection to fairness in statistical learning. ar Xiv preprint ar Xiv:2005.11720, 2020. [25] S. Hajian and J. Domingo-Ferrer. A methodology for direct and indirect discrimination prevention in data mining. IEEE transactions on knowledge and data engineering, 25 (7):1445 1459, 2012. [26] M. Hardt, E. Price, and N. Srebro. Equality of opportunity in supervised learning. Advances in neural information processing systems, 29, 2016. [27] L. Hu and Y. Chen. A short-term intervention for long-term fairness in the labor market. In Proceedings of the 2018 World Wide Web Conference, pages 1389 1398, 2018. [28] R. Jiang, A. Pacchiano, T. Stepleton, H. Jiang, and S. Chiappa. Wasserstein fair classiļ¬cation. In Uncertainty in artiļ¬cial intelligence, pages 862 872. PMLR, 2020. [29] F. Kamiran and T. Calders. Data preprocessing techniques for classiļ¬cation without discrimination. Knowledge and information systems, 33(1):1 33, 2012. [30] Y.-H. Kim and B. Pass. Wasserstein barycenters over Riemannian manifolds. Advances in Mathematics, 307:640 683, 2017. [31] T. Le Gouic and J.-M. Loubes. Existence and consistency of Wasserstein barycenters. Probability Theory and Related Fields, 168:901 917, 2017. [32] R. J. Mc Cann. A convexity principle for interacting gases. Advances in mathematics, 128(1):153 179, 1997. [33] E. O. of the President. Big data: Seizing opportunities, preserving values. President PACT report, 2014. [34] B. Pass. Optimal transportation with inļ¬nitely many marginals. Journal of Functional Analysis, 264(4):947 963, 2013. Fair Data Representation for Machine Learning at the Pareto Frontier [35] M. Redmond and A. Baveja. A data-driven software tool for enabling cooperative information sharing among police departments. European Journal of Operational Research, 141(3):660 678, 2002. [36] F. Santambrogio. Optimal transport for applied mathematicians. Birk auser, NY, 55 (58-63):94, 2015. [37] C. Silvia, J. Ray, S. Tom, P. Aldo, J. Heinrich, and A. John. A general approach to fairness with optimal transport. In Proceedings of the AAAI Conference on Artiļ¬cial Intelligence, volume 34(04), pages 3633 3640, 2020. [38] L. Sweeney. Discrimination in online ad delivery: Google ads, black names and white names, racial discrimination, and click advertising. Queue, 11(3):10 29, 2013. [39] E. G. Tabak and G. Trigila. Explanation of variability and removal of confounding factors from data through optimal transport. Communications on Pure and Applied Mathematics, 71(1):163 199, 2018. [40] C. Villani. Topics in optimal transportation, volume 58. American Mathematical Soc., 2021. [41] C. Villani et al. Optimal transport: old and new, volume 338. Springer, 2009. [42] L. F. Wightman. LSAC National Longitudinal Bar Passage Study. LSAC Research Report Series. 1998. [43] M. B. Zafar, I. Valera, M. Gomez Rodriguez, and K. P. Gummadi. Fairness beyond disparate treatment & disparate impact: Learning classiļ¬cation without disparate mistreatment. In Proceedings of the 26th international conference on world wide web, pages 1171 1180, 2017. [44] R. Zemel, Y. Wu, K. Swersky, T. Pitassi, and C. Dwork. Learning fair representations. In International conference on machine learning, pages 325 333. PMLR, 2013. [45] N. Zhou, Z. Zhang, V. N. Nair, H. Singhal, J. Chen, and A. Sudjianto. Bias, Fairness, and Accountability with AI and ML Algorithms. ar Xiv preprint ar Xiv:2105.06558, 2021. |
| Software Dependencies | No | The paper mentions software components like "logistic regression", "random forest", "linear regression", and "artiļ¬cial neural networks (ANN)" but does not provide specific version numbers for these or other libraries/frameworks used. |
| Experiment Setup | Yes | The two supervised learning methods we use are linear regression and artiļ¬cial neural networks (ANN with 4 linearly stacked layers where each of the ļ¬rst three layers has 32 units all with Re Lu activation while the last has 1 unit with linear activation). |