Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Semi-supervised Vertex Hunting, with Applications in Network and Text Analysis
Authors: Yicong Jiang, Zheng Tracy Ke
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate the performance of our method, compare it with the unsupervised SP algorithm, and apply our method to the problems in Section 4. By default, we choose α in our algorithm using the first approach there. We use the loss min P P ˆV V F /K, where minimum is over row permutations. Simulations: We fix n = 1000 and generate b by first sampling its K entries independently from Uniform(0.9, 1.1) and then normalizing it so that b = 1. The diagonal elements of V are 1, and the off-diagonal entries are independently generated from Uniform(0, 1/K) (we make the off-diagonal elements of V less than 1/K is to guarantee λK 1(V ) = O(1) when K is large). We consider a total of 5 experiments by varying the label ratio |S|/n, noise level σ, and dimension K, and 2 additional experiments comparing with unsupervised VH and studying the runtime. |
| Researcher Affiliation | Academia | Yicong Jiang Department of Statistics Harvard University EMAIL Zheng Tracy Ke Department of Statistics Harvard University EMAIL |
| Pseudocode | Yes | Algorithm 1: Semi-supervised Vertex Hunting (SSVH) 1 Input: K, XS, and ΠS. 1. Compute α RN from ΠS using the closed-form solution of either (9) or (10). 2. Construct c M(α) = Π Sdiag(Hα)XSX Sdiag(Hα)ΠS, where H is as in (3). Let ˆb be the eigenvector of c M(α) corresponding to the smallest eigenvalue. 3. Obtain ˆwi = (ˆb πi)/ ˆb πi 1, and let c WS be the matrix of stacking together the ˆwi for i S. Compute b V = (c W Sc WS) 1c W SXS. Output: b V (its rows are the estimated vertices). Algorithm 2: Semi-supervised Mixed Membership Estimation 1 Input: K, A, and ΠS. Algorithm parameters: a matrix U Rn K and a vector η RK. 1. Compute xi = U Aei/(η U Aei) for 1 i n. Let X = [x1, . . . , xn] Rn K and let XS RN K be the matrix of stacking the xi for i S. 2. (SSVH). Apply Algorithm 1 to (K, XS, ΠS) to obtain b V and the intermediate quantity ˆb. 3. Let b B = diag(ˆb)b V . For each i / S, compute eπi = e i AU b B ( b B b B ) 1. Let ˆπi be the vector by setting the negative entries in eπk to zero and re-normalizing to have a unit ℓ1-norm. Output: ˆπi for i / S. Algorithm 3: Semi-supervised Topic Modeling 1 Input: K, D, and A S. Algorithm parameters: a matrix U Rp K and a vector η RK. 1. Compute xj = U D ej/(η U D ej) for 1 j p. Let X = [x1, . . . , xp] Rp K and let XS RN K be the matrix of stacking the xi for i S. 2. (SSVH). Apply Algorithm 1 to (K, XS, A S) to obtain b V and the intermediate quantity ˆb. 3. Let b B = diag(ˆb)b V . Estimate A by b A = DU b B ( b B b B ) 1 Output: b A (its columns are the estimated topic vectors). |
| Open Source Code | Yes | Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [Yes] Justification: The data and code for our experiments are available in the supplementary materials. |
| Open Datasets | Yes | We use a co-authorship network for statisticians [18], with 2831 nodes and 71432 edges. ... We use the academic abstracts in MADStat [23]. The processed word count provided by authors use a vocabulary of 2106 words. |
| Dataset Splits | No | The paper primarily discusses simulations and semi-synthetic experiments where the 'label ratio' (|S|/n) is varied (e.g., from 1% to 5%, or fixed at 0.03, or |S| = 4K). This refers to the proportion of data points for which labels are known in a semi-supervised setting, not a traditional train/validation/test split of a dataset for model evaluation. |
| Hardware Specification | No | Question: For each experiment, does the paper provide sufficient information on the computer resources (type of compute workers, memory, time of execution) needed to reproduce the experiments? Answer: [Yes] Justification: Our experiments do not require any computer resources. All the experiments in our paper can be implemented on a personal computer. |
| Software Dependencies | No | The paper does not explicitly state specific version numbers for software libraries or environments used, such as Python, PyTorch, or CUDA versions. |
| Experiment Setup | Yes | Simulations: We fix n = 1000 and generate b by first sampling its K entries independently from Uniform(0.9, 1.1) and then normalizing it so that b = 1. The diagonal elements of V are 1, and the off-diagonal entries are independently generated from Uniform(0, 1/K) (we make the off-diagonal elements of V less than 1/K is to guarantee λK 1(V ) = O(1) when K is large). We consider a total of 5 experiments by varying the label ratio |S|/n, noise level σ, and dimension K, and 2 additional experiments comparing with unsupervised VH and studying the runtime. ... In experiment 3 and 4, we fix (K, |S|/n) = (3, 0.03) and vary the noise level σ. ... We set the label ratio N/n = 0.05 and compare our algorithm with two unsupervised algorithm, SP and a de-noised variant of SP called SVS [21] (it has a tuning parameter L, which is set to L = 10 K). |