Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Balanced Ranking with Relative Centrality: A multi-core periphery perspective
Authors: Chandra Sekhar Mukherjee, Jiapeng Zhang
ICLR 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We provide theoretical and extensive simulation support for our approach towards resolving the unbalancedness in MCPC. Finally, we consider graph embeddings of 11 single-cell datasets. We observe that top-ranked as per existing centrality measures are better separable into the ground truth communities. However, due to the unbalanced ranking, the top nodes often do not contain points from some communities. Here, our relative-centrality-based approach generates a ranking that provides a similar improvement in clusterability while providing significantly higher balancedness. |
| Researcher Affiliation | Academia | Chandra Sekhar Mukherjee Thomas Lord Department of Computer Science University of Southern California EMAIL; Jiapeng Zhang Thomas Lord Department of Computer Science University of Southern California EMAIL |
| Pseudocode | Yes | Algorithm 1: Neighbor Rank (N-Rank) with t-step initialization; Algorithm 2 A meta generalization: Meta-Relative-Rank (t, y, z) |
| Open Source Code | Yes | We have shared our codes for the simulation and real-world data in the supplementary material. The simulation experiments can be run using the simulation.ipynb file, and is self-contained (needed modules are provided in the zip). Due to the large size of the real-world vector datasets, we are unable to share them, but we have shared the code used to run the experiments. |
| Open Datasets | Yes | We use the 7 datasets from a recent database (Abdelaal et al., 2019), the popular Zheng8eq dataset (Duò et al., 2018), and two more large datasets (Smith et al., 2019), and a T-cell dataset (Savas et al., 2018) of cancer patients. All of these datasets have annotated labels available of their corresponding cell types that form the underlying communities. |
| Dataset Splits | No | The paper describes selecting a 'c-fraction' of top-ranked points (e.g., c=0.2) and applying clustering to the induced subgraph. This is a selection process for analysis rather than a traditional train/test/validation split for a machine learning model. |
| Hardware Specification | No | The paper does not explicitly mention any specific hardware details such as GPU models, CPU types, or memory specifications used for running the experiments. |
| Software Dependencies | No | The paper states that 'needed modules are provided in the zip' for simulation experiments but does not list specific software libraries or tools with their version numbers (e.g., 'Python 3.8', 'PyTorch 1.9'). |
| Experiment Setup | Yes | For each dataset, we first log-normalize it and then apply PCA dimensionality reduction of dimension 50, which is a standard pipeline in the single-cell analysis literature (Duò et al., 2018). Then, we obtain its 20-NN graph embedding, which we denote as G0. We set c = 0.2 (the results are robust to the choice of the cutoff point). In our experiments, we set t = 1 for the graphs generated by the MCPC block model and t = log |V | for both concentric GMM as well as real-world experiments. |