Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Semi-Supervised Eigenvectors for Large-Scale Locally-Biased Learning

Authors: Toke J. Hansen, Michael W. Mahoney

JMLR 2014 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In Section 5, we present an empirical analysis, including both toy data to illustrate how the knobs of our method work, as well as applications to realistic machine learning and data analysis problems. 5. Empirical Results
Researcher Affiliation Academia Toke J. Hansen EMAIL Department of Applied Mathematics and Computer Science Technical University of Denmark Richard Petersens Plads, 2800 Lyngby, Denmark Michael W. Mahoney EMAIL International Computer Science Institute and Dept. of Statistics University of California Berkeley, CA 94720-1776, USA
Pseudocode Yes Algorithm 1 Main algorithm to compute semi-supervised eigenvectors Require: LG, DG, s, κ = [κ1, . . . , κk]T , ϵ such that s T DG1 = 0, s T DGs = 1, κT 1 1 1: X = [1] 2: for t = 1 to k do 3: FF T I DGX(XT DGDGX) 1XT DG 4: λ2 where FF T LGFF T v2 = λ2FF T DGFF T v2 5: vol(G) 6: repeat 7: γt ( + )/2 (Binary search over γt) 8: xt (FF T (LG γt DG)FF T )+FF T DGs 9: Normalize xt such that x T t DGxt = 1 10: if (x T t DGs)2 > κt then γt else γt end if 11: until (x T t DGs)2 κt ϵ or ( + )/2 γt ϵ 12: Augment X with x t by letting X = [X, x t ]. 13: end for
Open Source Code No The paper does not provide an explicit statement of open-source code release for the methodology described, nor does it include a link to a code repository. The mention of 'our software distribution' is ambiguous and does not confirm public availability of the specific research code.
Open Datasets Yes Congressional voting data. In Section 5.2, we consider roll call voting data from the United States Congress that are based on (Poole and Rosenthal, 1991). Handwritten image data. In Section 5.3, we consider data from the MNIST digit data set (Lecun and Cortes). Large-scale network data. These improvements are demonstrated on data sets from the DIMACS implementation challenge, as well as on large web-crawls with more then 3 billion non-zeros in the adjacency matrix (Paolo et al., 2004, 2011; Paolo and Sebastiano, 2004).
Dataset Splits Yes For each Congress we perform 5-fold cross validation based on 80 samples and leave out the remaining 20 samples to estimate an unbiased test error. (Section 5.2) ...Figure 11 shows results based on a k-nearest neighbor graph constructed from 5% and 10% percent of the training data, where in both cases we used 10% for the test data. (Section 5.3.4)
Hardware Specification No No specific hardware details (e.g., GPU/CPU models, memory amounts, or cloud instances) used for running the experiments are provided in the paper.
Software Dependencies No The paper refers to computational methods and algorithms such as 'conjugate gradient method', 'Spectral Graph Transducer (SGT) of Joachims (2003)', and the 'Push algorithm by Andersen et al. (2006)', but it does not specify any software libraries, frameworks, or solvers with explicit version numbers.
Experiment Setup Yes Furthermore, we fix the regularization parameter of the SGT to c = 3200, and for simplicity we fix γ = 0 for all semi-supervised eigenvectors, implicitly defining the effective κ = [κ1, . . . , κk]T . (Section 5.3) ...we compare with a standard conjugate gradient implementation using a tolerance of 1e-6... (Section 5.4)