Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Matrix Completion and Low-Rank SVD via Fast Alternating Least Squares

Authors: Trevor Hastie, Rahul Mazumder, Jason D. Lee, Reza Zadeh

JMLR 2015 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this section we run some timing experiments on simulated and real datasets, and show performance results on the Netﬂix and Movie Lens data. Figure 1 shows timing results on four datasets. The ﬁrst three are simulation datasets of increasing size, and the last is the publicly available Movie Lens 100K data. These experiments were all run in R using the soft Impute package; see Section 7.
Researcher Affiliation	Collaboration	Trevor Hastie EMAIL Department of Statistics Stanford University, CA 94305, USA Rahul Mazumder EMAIL Department of Statistics Columbia University New York, NY 10027, USA Jason D. Lee EMAIL Institute for Computational and Mathematical Engineering Stanford University, CA 94305, USA Reza Zadeh EMAIL Databricks 2030 Addison Street, Suite 610 Berkeley, CA 94704, USA
Pseudocode	Yes	Algorithm 2.1 Rank-Restricted Soft SVD Algorithm 3.1 Rank-Restricted Eﬃcient Maximum-Margin Matrix Factorization: soft Impute-ALS Algorithm 5.1 soft Impute-ALS Algorithm 5.2 Alternating least squares ALS
Open Source Code	Yes	We have developed an R package soft Impute for ﬁtting these models (Hastie and Mazumder, 2013), which is available on CRAN. The implementation is available online at http://git.io/sparkfastals with documentation, in Scala.
Open Datasets	Yes	Figure 1 shows timing results on four datasets. The ﬁrst three are simulation datasets of increasing size, and the last is the publicly available Movie Lens 100K data. We used our soft Impute package in R to ﬁt a sequence of models on the Netﬂix competition data.
Dataset Splits	Yes	Here there are 480,189 users, 17,770 movies and a total of 100,480,507 ratings, making the resulting matrix 98.8% missing. There is a designated test set (the probe set ), a subset of 1,408,395 of the these ratings, leaving 99,072,112 for training.
Hardware Specification	Yes	We report iteration times using an Amazon EC2 cluster with 10 slaves and one master, of instance type c3.4xlarge . Each machine has 16 CPU cores and 30 GB of RAM.
Software Dependencies	No	We have developed an R package soft Impute for ﬁtting these models (Hastie and Mazumder, 2013), which is available on CRAN. We provide a distributed version of softimpute-ALS (given in Algorithm 5.1), built upon the Spark cluster programming framework. Where possible, hardware acceleration was used for local linear algebraic operations, via breeze and BLAS.
Experiment Setup	Yes	Each subplot in Figure 6.1 is labeled according to the size of the problem, the fraction missing, the value of λ used, the operating rank of the algorithms r, and the rank of the solution obtained. The sequence of twenty models took under six hours of computing on a Linux cluster with 300Gb of ram (with a fairly liberal relative convergence criterion of 0.001)