Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Conjugated Semantic Pool Improves OOD Detection with Pre-trained Vision-Language Models
Authors: Mengyuan Chen, Junyu Gao, Changsheng Xu
NeurIPS 2024 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments and related analysis on multiple OOD detection benchmarks with state-of-the-art performances (Section 5), which demonstrate the effectiveness of our method. |
| Researcher Affiliation | Academia | Mengyuan Chen MAIS, Institute of Automation, CAS School of Artificial Intelligence, UCAS EMAIL Junyu Gao MAIS, Institute of Automation, CAS School of Artificial Intelligence, UCAS EMAIL Changsheng Xu MAIS, Institute of Automation, CAS School of Artificial Intelligence, UCAS Pengcheng Laboratory EMAIL |
| Pseudocode | No | The paper includes mathematical equations and derivations but does not present any pseudocode or algorithm blocks explicitly labeled as such or formatted like an algorithm. |
| Open Source Code | Yes | Codes are available in https://github.com/Mengyuan Chen21/Neur IPS2024-CSP. |
| Open Datasets | Yes | We mainly evaluate our method on the widely-used Image Net-1k OOD detection benchmark [26]. This benchmark utilizes the large-scale Image Net-1k dataset as the ID data, and select samples from i Naturalist [60], SUN [66], Places [73], and Textures [8] as the OOD data. The categories of the OOD data have been manually selected to prevent overlap with Image Net-1k. |
| Dataset Splits | Yes | Image Net-1k, also referred to as ILSVRC 2012, is a subset of the larger Image Net dataset [9]. This dataset encompasses 1,000 object classes and includes 1,281,167 images for training, 50,000 images for validation, and 100,000 images for testing. |
| Hardware Specification | Yes | All experiments are conducted using Ge Force RTX 3090 GPUs. |
| Software Dependencies | No | The paper mentions key software components like 'CLIP Vi T-B/16 model' and 'Word Net' but does not specify their version numbers to ensure reproducibility of software dependencies. |
| Experiment Setup | Yes | Unless otherwise specified, we employ the CLIP Vi T-B/16 model as the pre-trained VLM and Word Net as the lexicon. The superclass set for constructing the conjugated semantic pool is {area, creature, environment, item, landscape, object, pattern, place, scene, space, structure, thing, view, vista}, which nearly encompasses all real-world objects. The ablation in Appendix C.5 shows that numerous alternative selections can also yield significant performance improvements. All hyper-parameters are directly inherited from [29] without any modification, including the ratio r which is set to 15%. |