Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Fairwashing: the risk of rationalization

Authors: Ulrich Aivodji, Hiromi Arai, Olivier Fortineau, Sébastien Gambs, Satoshi Hara, Alain Tapp

ICML 2019 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this section, we describe the experimental setting used to evaluate our rationalization algorithm as well as the results obtained.
Researcher Affiliation Academia 1Universit e du Qu ebec a Montr eal 2RIKEN Center for Advanced Intelligence Project 3JST PRESTO 4ENSTA Paris Tech 5Osaka University 6Ude M 7MILA.
Pseudocode Yes Algorithm 1 Laundry ML
Open Source Code Yes All our experiments can be reproduced using the code provided in https://github.com/aivodji/Laundry ML.
Open Datasets Yes We conduct our experiments on two real-world datasets that have been extensively used in the fairness literature due to their biased nature, namely Adult Income (Frank & Asuncion, 2010) and the Pro Publica Recidivism (Angwin et al., 2016) datasets.
Dataset Splits No The paper states “We first split each dataset into three subsets, namely the training set, the suing group and the test set”, but does not explicitly mention a “validation set” or “validation split” as part of the dataset partitioning.
Hardware Specification Yes Experiments were conducted on an Intel Core i7 (2.90 GHz, 16GB of RAM).
Software Dependencies No The paper mentions implementation languages like “C++” and “Python”, and references external algorithms like “CORELS” and “Lasso enumeration algorithm”, but it does not provide specific version numbers for these languages or any key libraries, solvers, or packages used.
Experiment Setup Yes For the scenario (S1), we use regularization parameters with values within the following ranges λ = {0.005, 0.01} and β = {0.0, 0.1, 0.2, 0.5, 0.7, 0.9} for both datasets, yielding 12 experiments per dataset. For each of these experiments, we enumerate 50 models. For the scenario (S2), we use the regularization parameters λ = 0.005 and β = {0.1, 0.3, 0.5, 0.7, 0.9} for both datasets. ... we apply the k-nearest neighbour algorithm with k set to 10% of the size of the suing group.