Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Clustering from General Pairwise Observations with Applications to Time-varying Graphs

Authors: Shiau Hong Lim, Yudong Chen, Huan Xu

JMLR 2017 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We provide empirical results on both synthetic and real data that corroborate with our theoretical findings.
Researcher Affiliation Collaboration Shiau Hong Lim EMAIL IBM Research 10 Marina Boulevard Singapore 018983 Yudong Chen EMAIL School of Operations Research and Information Engineering Cornell University Ithaca, NY 14853, USA Huan Xu EMAIL School of Industrial and Systems Engineering Georgia Institute of Technology 755 Ferst Drive, NW, Atlanta, GA 30332, USA
Pseudocode Yes We provide the pseudocode for a complete implementation of the programs (2) and (4) in Algorithms 1 and 2 below. Algorithm 1 ADMM solver for Program (2) Algorithm 2 ADMM solver for Program (4)
Open Source Code No The paper does not provide an explicit statement about releasing code for the described methodology, nor does it include a link to a code repository.
Open Datasets Yes We consider three real-world data sets for which reliable ground truth is available. All three sets involve interactions among different social groups, with different clustering structures and different temporal granularity. The Reality Mining data set (Eagle and Pentland, 2006) contains individuals from two main groups, the MIT Media Lab and the Sloan Business School, which we use as the ground-truth clusters. ... The next two data sets, Workplace and Primary-school, are from Genois et al. (2015) and Stehl et al. (2011), respectively.
Dataset Splits No For Reality Mining: "In each trial, the in/crosscluster label distributions are estimated from a fraction of randomly selected pairwise interaction data." For Workplace: "We split this data set (over 10 days) into 10 daily snapshots". For Primary-school: "We split this data set (over 2 days) into 18 hourly snapshots". While the paper describes how data was structured for analysis (e.g., number of snapshots, sub-dataset sizes), it does not provide specific train/test/validation splits (e.g., percentages, sample counts for each split) for model evaluation.
Hardware Specification No Figures 6 and 7 plot the average CPU time needed to solve program (2) with Algorithm 1 on a typical quad-core desktop machine in Matlab. This description is too general and lacks specific hardware details like CPU model, GPU type, or memory.
Software Dependencies No The paper mentions "Alternating Direction Method of Multipliers (ADMM)" and "Matlab" but does not specify version numbers for either, which is required for reproducibility.
Experiment Setup Yes We find that in practice, using the tuning parameter η = 2n for the program (4) works well. The criterion for convergence is specified by the threshold ϵ > 0, and using ϵ = 10-4 provides a good tradeoff between the convergence time and the quality of the solution. ... if Xk+1 Y k+1 F > τρ Y k+1 Y k F , then set ρ 2ρ and Qk+1 Qk+1/2. On the other hand, if τ Xk+1 Y k+1 F < ρ Y k+1 Y k F , then set ρ ρ/2 and Qk+1 2Qk+1. Typically τ = 10 is a stable choice, which we use in all our experiments.