Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Matrix Completion and Low-Rank SVD via Fast Alternating Least Squares
Authors: Trevor Hastie, Rahul Mazumder, Jason D. Lee, Reza Zadeh
JMLR 2015 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this section we run some timing experiments on simulated and real datasets, and show performance results on the Netflix and Movie Lens data. Figure 1 shows timing results on four datasets. The first three are simulation datasets of increasing size, and the last is the publicly available Movie Lens 100K data. These experiments were all run in R using the soft Impute package; see Section 7. |
| Researcher Affiliation | Collaboration | Trevor Hastie EMAIL Department of Statistics Stanford University, CA 94305, USA Rahul Mazumder EMAIL Department of Statistics Columbia University New York, NY 10027, USA Jason D. Lee EMAIL Institute for Computational and Mathematical Engineering Stanford University, CA 94305, USA Reza Zadeh EMAIL Databricks 2030 Addison Street, Suite 610 Berkeley, CA 94704, USA |
| Pseudocode | Yes | Algorithm 2.1 Rank-Restricted Soft SVD Algorithm 3.1 Rank-Restricted Efficient Maximum-Margin Matrix Factorization: soft Impute-ALS Algorithm 5.1 soft Impute-ALS Algorithm 5.2 Alternating least squares ALS |
| Open Source Code | Yes | We have developed an R package soft Impute for fitting these models (Hastie and Mazumder, 2013), which is available on CRAN. The implementation is available online at http://git.io/sparkfastals with documentation, in Scala. |
| Open Datasets | Yes | Figure 1 shows timing results on four datasets. The first three are simulation datasets of increasing size, and the last is the publicly available Movie Lens 100K data. We used our soft Impute package in R to fit a sequence of models on the Netflix competition data. |
| Dataset Splits | Yes | Here there are 480,189 users, 17,770 movies and a total of 100,480,507 ratings, making the resulting matrix 98.8% missing. There is a designated test set (the probe set ), a subset of 1,408,395 of the these ratings, leaving 99,072,112 for training. |
| Hardware Specification | Yes | We report iteration times using an Amazon EC2 cluster with 10 slaves and one master, of instance type c3.4xlarge . Each machine has 16 CPU cores and 30 GB of RAM. |
| Software Dependencies | No | We have developed an R package soft Impute for fitting these models (Hastie and Mazumder, 2013), which is available on CRAN. We provide a distributed version of softimpute-ALS (given in Algorithm 5.1), built upon the Spark cluster programming framework. Where possible, hardware acceleration was used for local linear algebraic operations, via breeze and BLAS. |
| Experiment Setup | Yes | Each subplot in Figure 6.1 is labeled according to the size of the problem, the fraction missing, the value of λ used, the operating rank of the algorithms r, and the rank of the solution obtained. The sequence of twenty models took under six hours of computing on a Linux cluster with 300Gb of ram (with a fairly liberal relative convergence criterion of 0.001) |