Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Counterfactual Implicit Feedback Modeling

Authors: Chuan Zhou, Lina Yao, Haoxuan Li, Mingming Gong

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments are conducted on publicly available datasets to demonstrate the effectiveness of our approach. The code is available at https://github.com/zhouchuanCN/NeurIPS25-Counter-IF. 4 Experiments 4.1 Experimental Setup Datasets. To evaluate the performance of unbiased recommendations, we utilize two real-world datasets: Coat and Yahoo! R3. Each dataset includes both biased training data and an unbiased test set. 4.2 Performance Comparison We evaluate the performance of our proposed method against several baseline approaches on multiple datasets, as shown in Table 2. 4.3 Ablation Study We conduct ablation studies to validate the effectiveness of our model by examining the impact of the Wasserstein distance (Wass), pairwise loss (Pair), and pointwise loss (Point). Table 3 shows performance metrics (NDCG@K, Recall@K, MAP@K) on Yahoo and Coat datasets. 4.4 Sensitivity Analysis Threshold and proportion. Figure 3 shows the sensitivity of the model’s performance on the proportion of the HE group α and the proportion of the HU group β for samples in D0.
Researcher Affiliation	Academia	Chuan Zhou1,2 Lina Yao3,4 Haoxuan Li5,2, Mingming Gong1,2, 1The University of Melbourne 2Mohamed bin Zayed University of Artificial Intelligence 3The University of New South Wales 4CSIRO’s Data61 5Peking University EMAIL, EMAIL, EMAIL, EMAIL
Pseudocode	No	The paper describes the methodology in detailed prose within Section 3 "Methodology" and its subsections, but it does not include a clearly labeled "Pseudocode" or "Algorithm" block, nor does it present the steps in a structured, code-like format.
Open Source Code	Yes	Extensive experiments are conducted on publicly available datasets to demonstrate the effectiveness of our approach. The code is available at https://github.com/zhouchuanCN/NeurIPS25-Counter-IF.
Open Datasets	Yes	Extensive experiments are conducted on publicly available datasets to demonstrate the effectiveness of our approach. ... To evaluate the performance of unbiased recommendations, we utilize two real-world datasets: Coat and Yahoo! R3. Each dataset includes both biased training data and an unbiased test set. ... We employed the preprocessing steps following previous studies [23, 36], which can be seen in the Appendix B.
Dataset Splits	Yes	For each dataset, the data was divided into training and test sets. A portion of 10% from the training set was randomly selected to serve as the validation set for hyperparameter tuning.
Hardware Specification	Yes	We conduct all experiments on a server with 112-core Intel(R) Xeon(R) Gold 6330 CPU @ 2.00GHz. The server is equipped with a 512GB random access memory (RAM).
Software Dependencies	No	The paper does not specify any particular software versions (e.g., Python, PyTorch, TensorFlow, or specific libraries) used for implementation or experimentation. It describes the methods but not the software environment with version numbers.
Experiment Setup	Yes	Hyperparamter Tuning. For each dataset, the data was divided into training and test sets. A portion of 10% from the training set was randomly selected to serve as the validation set for hyperparameter tuning. Several key parameters were adjusted during this phase. The latent factor dimensions, representing user-item interactions, were explored within the range of 100 to 300, while the L2 regularization term was fine-tuned between [10 7, 10 3] for all models, and the λpoint as well as λpair are tuned in {0.01, 0.1, 1, 10, 100}.