Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

A reproducibility study of “User-item fairness tradeoffs in recommendations”

Authors: Sander Honig, Elyanne Oey, Lisanne Wallaard, Sharanda Suttorp, Clara Rus

TMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	This reproducibility study focuses on examining the reproducibility of Greenwood et al. (2024) by replicating the experiments using their published code on the same ar Xiv dataset and Semantic Scholar data. Additionally, we extend their research in two ways: (i) verifying the generalizability of their findings on a different dataset (Amazon books reviews), and (ii) analyzing the tradeoffs when recommending multiple items to a user instead of a single item.
Researcher Affiliation	Academia	Sander Honig EMAIL University of Amsterdam Elyanne Oey EMAIL University of Amsterdam Lisanne Wallaard EMAIL University of Amsterdam Sharanda Suttorp EMAIL University of Amsterdam Clara Rus EMAIL University of Amsterdam
Pseudocode	No	The paper describes mathematical equations for the theoretical framework (Equation 1, 2, 3, 4, 5) but does not contain a clearly labeled pseudocode or algorithm block.
Open Source Code	Yes	The Git Hub repository containing the code discussed in this paper can be found at the following link: https://github.com/sanderhonig/RE-User-item-fairness-tradeoffs-in-recommendations.
Open Datasets	Yes	In this paper, we use the ar Xiv dataset2 to reproduce and extend the work of Greenwood et al. (2024). Furthermore, we use the Amazon books reviews dataset3 to investigate how their findings generalize to a different domain, namely book e-commerce. Accessible via https://www.kaggle.com/datasets/Cornell-University/arxiv. Accessible via https://www.kaggle.com/datasets/mohamedbakhet/amazon-books-reviews
Dataset Splits	Yes	The train and test sets are created based on a paper s year of publication: all papers published before 2020 are selected for the train set, and all papers from 2020 are selected for the test set, resulting in 255,138 and 65,948 papers respectively. For the training set, we considered books published before 2007. This training set was used to create embeddings for the reviewers, who represent the users in the recommendation system. It contains 30,506 books, 475,382 reviews, and is contributed to by 237,310 unique reviewers. The test set included books published from 2007 until the end of 2011, and contained 14,548 books. ...we significantly reduce the test set available for logistic regression down to 1/12th of our initial test set by sampling proportionally to the existing subcategories, visualized in Appendix A.1 Figure 6. This ended up in successfully gathering additional information for 2,188 papers...
Hardware Specification	Yes	For the experiments described in Section 3.3, we used a node containing nine cores of the Intel Xeon Platinum 8360Y, an NVIDIA A100 GPU, and 60GB of DRAM.
Software Dependencies	No	The paper mentions using "scikit-learn s TFIDF vectorizer" and "Python s statsmodels module" but does not provide specific version numbers for these software components or Python itself.
Experiment Setup	Yes	For heterogeneous users, we sampled 40 authors and 20 papers out of the entire test set. Then, for 50 values of γ between 0 and 1, U is calculated and plotted. In total 10 curves are calculated, after which the mean and (two) standard deviations are plotted. For homogeneous users, all authors are first grouped into 25 clusters. ...The second experiment examines the difference in user-item fairness tradeoff between users for whom preference data is present and cold start users. The latter category is constructed by treating 10% of the sampled users as cold start users by removing their embedding. ...For robustness, we performed logistic regression three times with random seeds 42, 999, and 123. ...we tested 50 values of γ between 0 and 1, clustered the 1,000 users into 25 clusters using the k-means algorithm for the homogeneous population for the first experiment, and again treated 10% of the population to be misestimated for the second experiment. ...In this experiment, k is set to three and five...