Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Fairness without Demographics through Knowledge Distillation

Authors: Junyi Chai, Taeuk Jang, Xiaoqian Wang

NeurIPS 2022 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experimental results on three datasets show that our method outperforms state-of-the-art alternatives, with notable improvements in group fairness and with relatively small decrease in accuracy.
Researcher Affiliation	Academia	Junyi Chai, Taeuk Jang, Xiaoqian Wang Elmore Family School of Electrical and Computer Engineering Purdue University West Lafayette, IN 47906 EMAIL
Pseudocode	No	The paper does not contain any pseudocode or clearly labeled algorithm blocks.
Open Source Code	Yes	Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? [Yes]
Open Datasets	Yes	New Adult: The Adult reconstruction dataset (Ding et al., 2021) contains 49,531 samples with 14 attributes. COMPAS: The COMPAS dataset (Larson et al., 2016) contains 7,215 samples with 11 attributes. Following previous works on fairness (Zafar et al., 2017), we only select black and white defendants in COMPAS dataset, and the modified dataset contains 6,150 samples. The goal is to predict whether a defendant reoffends within two years, and we choose sex and race as sensitive attributes. Celeb A: The Celeb A dataset (Liu et al., 2015) contains 202,599 face images, each of resolution 178 x 218, with 40 binary attributes.
Dataset Splits	Yes	To avoid large discrepancies in testing data, before each repetition, we randomly spilt data into 50% training data, 10% validation data and 40% test data.
Hardware Specification	Yes	We implement our method in Py Torch 1.10.1 with one NVIDIA RTX-3090 GPU.
Software Dependencies	Yes	We implement our method in Py Torch 1.10.1 with one NVIDIA RTX-3090 GPU.
Experiment Setup	Yes	We build the teacher model using Res Net-152 (He et al., 2016) and student model using Res Net-18 (He et al., 2016). For student model trained on softmax label, the temperature is tuned to find the best validation accuracy. The hyperparameters of comparing methods are tuned with binary search to find global minimum, as suggested in the original paper (Hashimoto et al., 2018).