Conjugated Semantic Pool Improves OOD Detection with Pre-trained Vision-Language Models

Authors: Mengyuan Chen, Junyu Gao, Changsheng Xu

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments and related analysis on multiple OOD detection benchmarks with state-of-the-art performances (Section 5), which demonstrate the effectiveness of our method.
Researcher Affiliation Academia Mengyuan Chen MAIS, Institute of Automation, CAS School of Artificial Intelligence, UCAS chenmengyuan2021@ia.ac.cn Junyu Gao MAIS, Institute of Automation, CAS School of Artificial Intelligence, UCAS junyu.gao@nlpr.ia.ac.cn Changsheng Xu MAIS, Institute of Automation, CAS School of Artificial Intelligence, UCAS Pengcheng Laboratory csxu@nlpr.ia.ac.cn
Pseudocode No The paper includes mathematical equations and derivations but does not present any pseudocode or algorithm blocks explicitly labeled as such or formatted like an algorithm.
Open Source Code Yes Codes are available in https://github.com/Mengyuan Chen21/Neur IPS2024-CSP.
Open Datasets Yes We mainly evaluate our method on the widely-used Image Net-1k OOD detection benchmark [26]. This benchmark utilizes the large-scale Image Net-1k dataset as the ID data, and select samples from i Naturalist [60], SUN [66], Places [73], and Textures [8] as the OOD data. The categories of the OOD data have been manually selected to prevent overlap with Image Net-1k.
Dataset Splits Yes Image Net-1k, also referred to as ILSVRC 2012, is a subset of the larger Image Net dataset [9]. This dataset encompasses 1,000 object classes and includes 1,281,167 images for training, 50,000 images for validation, and 100,000 images for testing.
Hardware Specification Yes All experiments are conducted using Ge Force RTX 3090 GPUs.
Software Dependencies No The paper mentions key software components like 'CLIP Vi T-B/16 model' and 'Word Net' but does not specify their version numbers to ensure reproducibility of software dependencies.
Experiment Setup Yes Unless otherwise specified, we employ the CLIP Vi T-B/16 model as the pre-trained VLM and Word Net as the lexicon. The superclass set for constructing the conjugated semantic pool is {area, creature, environment, item, landscape, object, pattern, place, scene, space, structure, thing, view, vista}, which nearly encompasses all real-world objects. The ablation in Appendix C.5 shows that numerous alternative selections can also yield significant performance improvements. All hyper-parameters are directly inherited from [29] without any modification, including the ratio r which is set to 15%.