Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

STAIR: Manipulating Collaborative and Multimodal Information for E-Commerce Recommendation

Authors: Cong Xu, Yunhang He, Jun Wang, Wei Zhang

AAAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments validate the superiority of STAIR in recommendation accuracy and efficiency. Note that although STAIR achieves state-of-the-art performance in e-commerce multimodal recommendation, it may not fully mine the raw multimodal features in contentdriven scenarios such as news and video recommendation (Wu et al. 2020).
Researcher Affiliation	Academia	East China Normal University Shanghai, China EMAIL
Pseudocode	Yes	Algorithm 1: STAIR training procedures.
Open Source Code	Yes	Code https://github.com/yhhe2004/STAIR
Open Datasets	Yes	We consider in this paper three commonly used e-commerce datasets obtained from Amazon reviews, including Baby, Sports, and Electronics. As suggested by (Zhang et al. 2021; Zhou and Shen 2023), we filter out users and items with less than 5 interactions, and Table 1 presents the dataset statistics after preprocessing. Each dataset contains item thumbnails and text descriptions (e.g., title, brand). Following (Zhou et al. 2023), the 4,096-dimensional visual features published in (Ni, Li, and Mc Auley 2019), and the 384-dimensional sentence embeddings published in (Zhou 2023) are used for experiments.
Dataset Splits	No	The paper mentions evaluating on a 'test set' and using 'validation NDCG@20 metric' for hyperparameter tuning, implying the existence of these splits. However, it does not provide explicit percentages, sample counts, or a detailed methodology for how the training, validation, and test sets were created or partitioned.
Hardware Specification	Yes	When dealing with larger datasets such as Electronics, the computational and memory requirements make MMSSL impossible to implement in realworld recommendations. [...] indicates that the method cannot be performed with a RTX 3090 GPU.
Software Dependencies	No	Adam W is employed as the optimizer for training STAIR, whose learning rate is searched from {1e-4, 5e-4, 1e-3, 5e-3} and the weight decay in the range of [0, 1]. While Adam W is mentioned, no specific version numbers for any software, libraries, or programming languages are provided.
Experiment Setup	Yes	For fairness (Zhang et al. 2021; Zhou and Shen 2023), we fix the embedding dimension to 64 and the number of convolutional layers to 3 for both FSC and BSC processes. As suggested in (Xu et al. 2024), Adam W is employed as the optimizer for training STAIR, whose learning rate is searched from {1e-4, 5e-4, 1e-3, 5e-3} and the weight decay in the range of [0, 1]. The exponent γ that controls the changing rate of the layer weights can be chosen from {0.01, 0.05, 0.1, 0.2, 0.3, 0.4, 0.5}. In addition, we adjust the number of neighbors km from {1, 3, 5, 10, 20} for each modality m M separately. For all methods, we report the results on the best checkpoint identified by the validation NDCG@20 metric over 500 epochs.