On the Generalizability and Predictability of Recommender Systems
Authors: Duncan McElfresh, Sujay Khandagale, Jonathan Valverde, John Dickerson, Colin White
NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this work, we start by giving the first large-scale study of recommender system approaches by comparing 24 algorithms and 100 sets of hyperparameters across 85 datasets and 315 metrics. We run a large-scale study of recommender systems, showing that the best algorithm and hyper-parameters are highly dependent on the dataset and user-defined performance metric. In this section, we present a large-scale empirical study of rec-sys algorithms across a large, diverse set of datasets and metrics. |
| Researcher Affiliation | Collaboration | Duncan Mc Elfresh 1, Sujay Khandagale 1, Jonathan Valverde 1,3, John P. Dickerson2,3, Colin White1 1Abacus.AI, 2Arthur AI, 3University of Maryland |
| Pseudocode | No | The paper describes the 'Rec Zilla Algorithm Selection Pipeline' in text, but does not provide a formal pseudocode block or algorithm figure. |
| Open Source Code | Yes | We not only release our code and pretrained Rec Zilla models, but also all of our raw experimental results, so that practitioners can train a Rec Zilla model for their desired performance metric: https://github.com/naszilla/reczilla. |
| Open Datasets | Yes | We run the algorithms on 85 datasets from 19 dataset families : Amazon [71], Anime [16], Book Crossing [87], Ciao DVD [45, 59], Dating (Libimseti.cz) [59, 60], Epinions [67, 68], Film Trust [44], Frappe [8], Gowalla [15], Jester2 [40], Last FM [11], Market Bias-Electronics and Market Bias Mod Cloth [80], Movie Tweetings [33], Movielens [49], Netflix Prize [9], Recipes [66], Wikilens [37], and Yahoo [34]. |
| Dataset Splits | Yes | Each dataset s train, validation, and test split is based on leave-last-k-out (and our repository also includes splits based on global timestamp). |
| Hardware Specification | Yes | Each algorithm is allocated a 10 hour limit for each dataset split; we train and test the algorithm with at most 100 hyperparameter sets on an n1-highmem-2 Google Cloud instance, until the time limit is reached. Each neural network method is trained on each dataset using the default hyperparameters used in its respective paper, with a time limit of 15 hours on an NVIDIA Tesla T4 GPU. |
| Software Dependencies | No | The paper mentions using implementations from the codebase of Dacrema et al. [28] and cites the Surprise Python library [53], but it does not specify version numbers for these or any other software dependencies used in their experimental setup. |
| Experiment Setup | Yes | We use a random hyperparameter search for all methods, with the exception of neural network based methods. Since neural networks require far more resources to train (longer training time, and requiring GPUs), we use only the default hyperparameters for neural algorithms. For each non-neural algorithm, we expose several hyperparameters and give ranges based on common values. For each dataset, we run each algorithm on a random sample of up to 100 hyperparameter sets. Each algorithm is allocated a 10 hour limit for each dataset split... All neural network methods are trained with batch size 64, for up to 100 epochs; early stopping occurs if loss does not improve in 5 epochs. |