Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Bridging the gap to real-world language-grounded visual concept learning
Authors: whie jung, Semin Kim, Junee Kim, Seunghoon Hong
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We validate our framework on complex and unstructured real-world data, where each image contains a diverse set of conceptual axes that is infeasible to manually predefine these axes to cover all possible variations within the data. To this end, we first conduct experiments on a subset of the Image Net dataset. ... We report quantitative comparison of our method to the baselines in Table 1. Our methods consistently outperform all baselines on all of the datasets by a clear margin. High CLIP and BLIP scores demonstrate the effectiveness of our method in capturing image-related visual concepts. A human evaluation in Table 2 provides a more direct assessment of reflecting subtle visual nuances. ... We conduct an ablation study on VLM choices, architectural design choices, and objective functions to examine the robustness and effectiveness of our choices. |
| Researcher Affiliation | Academia | Whie Jung Semin Kim Junee Kim Seunghoon Hong School of Computing, KAIST EMAIL |
| Pseudocode | No | The paper describes the methodology in narrative text and uses diagrams (e.g., Figure 1) to illustrate the framework, but it does not include any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | The code is available at https://github.com/whieya/Language-grounded-VCL. |
| Open Datasets | Yes | We validate our framework on complex and unstructured real-world data, where each image contains a diverse set of conceptual axes that is infeasible to manually predefine these axes to cover all possible variations within the data. To this end, we first conduct experiments on a subset of the Image Net dataset. ... using relatively controlled datasets with diverse concept axes, such as Celeb A-HQ [13], AFHQ-Dog, and AFHQ-Cat [5]. |
| Dataset Splits | Yes | For training and validation, we use the following splits: 28k/0.6k images for Image Net-S20, 27k/3k for Celeb A-HQ, and around 5k/0.5k for AFHQ-Dog and AFHQ-Cat. |
| Hardware Specification | Yes | All of our experiments are conducted on a GPU Server that consists of an Intel Xeon Gold 6230 CPU, 256GB RAM, and 8 NVIDIA RTX 6000 GPUs (with 48GB VRAM). |
| Software Dependencies | No | The paper mentions several models/frameworks used like Intern VL [4], DINO-v2 [26], and Stable Diffusion-based T2I decoder [30], but it does not specify versions for core software dependencies such as Python, PyTorch, or CUDA. |
| Experiment Setup | Yes | Table 7: Hyperparameters used in our experiments. General Batch Size 32 Training Steps 100k Learning Rate 0.00003 Concept Encoder Layers 4 Hidden Dim 768 Number of Heads 8 Regression Network Layers 768 Input Dimension 768 Hidden Dimensio 768 Activation Function Re LU |