Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Few-Shot Learning from Gigapixel Images via Hierarchical Vision-Language Alignment and Modeling

Authors: Bryan Wong, Jongwoo Kim, Huazhu Fu, Mun Yi

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments on TCGA breast, lung, and kidney cancer datasets demonstrate that Hi VE-MIL consistently outperforms both traditional MIL and recent VLM-based MIL approaches, achieving gains of up to 4.1% in macro F1 under 16-shot settings.
Researcher Affiliation	Academia	1KAIST 2IHPC, A*STAR EMAIL EMAIL
Pseudocode	Yes	D Text-Guided Dynamic Filtering Pseudocode Algorithm 1 Text-Guided Dynamic Filtering (TGDF)
Open Source Code	Yes	The code is available at https://github.com/bryanwong17/Hi VE-MIL.
Open Datasets	Yes	4.1 Experimental Settings Datasets. We utilize three publicly available WSI datasets: TCGA-NSCLC (lung), TCGA-BRCA (breast), and TCGA-RCC (kidney), obtained from The Cancer Genome Atlas (TCGA).3
Dataset Splits	Yes	Following [48], each dataset is split into training, validation, and test sets using a fixed 4:3:3 ratio. For the few-shot setting, we randomly sample 4, 8, and 16 WSIs per class from the training set.
Hardware Specification	Yes	All experiments are run using Py Torch [42] on a workstation with two NVIDIA RTX A100 GPUs.
Software Dependencies	No	All experiments are run using Py Torch [42] on a workstation with two NVIDIA RTX A100 GPUs. Our graph-based modules are implemented using Py Torch Geometric [13]. All experiments are conducted on Ubuntu 20.04.6 using a workstation equipped with two NVIDIA A100 GPUs (40 GB each); however, only one GPU is used for training each model. Complete package versions and dependencies are listed in the requirements.txt file available in our Git Hub repository (linked in the Abstract).
Experiment Setup	Yes	Hi VE-MIL operates on 5 and 20 patches, using GPT-4o [1] to generate O = 4 coarse-level texts and K = 3 fine-level substructures per class. We use L = 16 learnable context tokens (Eq. 1) and apply the TGDF threshold α = 0.5 (Eqs. 2, 3). The HHG consists of two layers and a 2-head in MSA. HTCL is used with λ = 0.5 (Eq. 9). We train using Adam optimizer [31] (learning rate: 1e 4, weight decay: 1e 5), batch size 1, for up to 50 epochs with early stopping (patience 10).