reproducibilityindex.ai

On the Generalizability and Predictability of Recommender Systems

Authors: Duncan McElfresh, Sujay Khandagale, Jonathan Valverde, John Dickerson, Colin White

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this work, we start by giving the ﬁrst large-scale study of recommender system approaches by comparing 24 algorithms and 100 sets of hyperparameters across 85 datasets and 315 metrics. We run a large-scale study of recommender systems, showing that the best algorithm and hyper-parameters are highly dependent on the dataset and user-deﬁned performance metric. In this section, we present a large-scale empirical study of rec-sys algorithms across a large, diverse set of datasets and metrics.
Researcher Affiliation	Collaboration	Duncan Mc Elfresh 1, Sujay Khandagale 1, Jonathan Valverde 1,3, John P. Dickerson2,3, Colin White1 1Abacus.AI, 2Arthur AI, 3University of Maryland
Pseudocode	No	The paper describes the 'Rec Zilla Algorithm Selection Pipeline' in text, but does not provide a formal pseudocode block or algorithm figure.
Open Source Code	Yes	We not only release our code and pretrained Rec Zilla models, but also all of our raw experimental results, so that practitioners can train a Rec Zilla model for their desired performance metric: https://github.com/naszilla/reczilla.
Open Datasets	Yes	We run the algorithms on 85 datasets from 19 dataset families : Amazon [71], Anime [16], Book Crossing [87], Ciao DVD [45, 59], Dating (Libimseti.cz) [59, 60], Epinions [67, 68], Film Trust [44], Frappe [8], Gowalla [15], Jester2 [40], Last FM [11], Market Bias-Electronics and Market Bias Mod Cloth [80], Movie Tweetings [33], Movielens [49], Netﬂix Prize [9], Recipes [66], Wikilens [37], and Yahoo [34].
Dataset Splits	Yes	Each dataset s train, validation, and test split is based on leave-last-k-out (and our repository also includes splits based on global timestamp).
Hardware Specification	Yes	Each algorithm is allocated a 10 hour limit for each dataset split; we train and test the algorithm with at most 100 hyperparameter sets on an n1-highmem-2 Google Cloud instance, until the time limit is reached. Each neural network method is trained on each dataset using the default hyperparameters used in its respective paper, with a time limit of 15 hours on an NVIDIA Tesla T4 GPU.
Software Dependencies	No	The paper mentions using implementations from the codebase of Dacrema et al. [28] and cites the Surprise Python library [53], but it does not specify version numbers for these or any other software dependencies used in their experimental setup.
Experiment Setup	Yes	We use a random hyperparameter search for all methods, with the exception of neural network based methods. Since neural networks require far more resources to train (longer training time, and requiring GPUs), we use only the default hyperparameters for neural algorithms. For each non-neural algorithm, we expose several hyperparameters and give ranges based on common values. For each dataset, we run each algorithm on a random sample of up to 100 hyperparameter sets. Each algorithm is allocated a 10 hour limit for each dataset split... All neural network methods are trained with batch size 64, for up to 100 epochs; early stopping occurs if loss does not improve in 5 epochs.