Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Matrix Completion and Low-Rank SVD via Fast Alternating Least Squares

Authors: Trevor Hastie, Rahul Mazumder, Jason D. Lee, Reza Zadeh

JMLR 2015 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this section we run some timing experiments on simulated and real datasets, and show performance results on the Netflix and Movie Lens data. Figure 1 shows timing results on four datasets. The first three are simulation datasets of increasing size, and the last is the publicly available Movie Lens 100K data. These experiments were all run in R using the soft Impute package; see Section 7.
Researcher Affiliation Collaboration Trevor Hastie EMAIL Department of Statistics Stanford University, CA 94305, USA Rahul Mazumder EMAIL Department of Statistics Columbia University New York, NY 10027, USA Jason D. Lee EMAIL Institute for Computational and Mathematical Engineering Stanford University, CA 94305, USA Reza Zadeh EMAIL Databricks 2030 Addison Street, Suite 610 Berkeley, CA 94704, USA
Pseudocode Yes Algorithm 2.1 Rank-Restricted Soft SVD Algorithm 3.1 Rank-Restricted Efficient Maximum-Margin Matrix Factorization: soft Impute-ALS Algorithm 5.1 soft Impute-ALS Algorithm 5.2 Alternating least squares ALS
Open Source Code Yes We have developed an R package soft Impute for fitting these models (Hastie and Mazumder, 2013), which is available on CRAN. The implementation is available online at http://git.io/sparkfastals with documentation, in Scala.
Open Datasets Yes Figure 1 shows timing results on four datasets. The first three are simulation datasets of increasing size, and the last is the publicly available Movie Lens 100K data. We used our soft Impute package in R to fit a sequence of models on the Netflix competition data.
Dataset Splits Yes Here there are 480,189 users, 17,770 movies and a total of 100,480,507 ratings, making the resulting matrix 98.8% missing. There is a designated test set (the probe set ), a subset of 1,408,395 of the these ratings, leaving 99,072,112 for training.
Hardware Specification Yes We report iteration times using an Amazon EC2 cluster with 10 slaves and one master, of instance type c3.4xlarge . Each machine has 16 CPU cores and 30 GB of RAM.
Software Dependencies No We have developed an R package soft Impute for fitting these models (Hastie and Mazumder, 2013), which is available on CRAN. We provide a distributed version of softimpute-ALS (given in Algorithm 5.1), built upon the Spark cluster programming framework. Where possible, hardware acceleration was used for local linear algebraic operations, via breeze and BLAS.
Experiment Setup Yes Each subplot in Figure 6.1 is labeled according to the size of the problem, the fraction missing, the value of λ used, the operating rank of the algorithms r, and the rank of the solution obtained. The sequence of twenty models took under six hours of computing on a Linux cluster with 300Gb of ram (with a fairly liberal relative convergence criterion of 0.001)