Conjugated Semantic Pool Improves OOD Detection with Pre-trained Vision-Language Models
Authors: Mengyuan Chen, Junyu Gao, Changsheng Xu
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments and related analysis on multiple OOD detection benchmarks with state-of-the-art performances (Section 5), which demonstrate the effectiveness of our method. |
| Researcher Affiliation | Academia | Mengyuan Chen MAIS, Institute of Automation, CAS School of Artificial Intelligence, UCAS chenmengyuan2021@ia.ac.cn Junyu Gao MAIS, Institute of Automation, CAS School of Artificial Intelligence, UCAS junyu.gao@nlpr.ia.ac.cn Changsheng Xu MAIS, Institute of Automation, CAS School of Artificial Intelligence, UCAS Pengcheng Laboratory csxu@nlpr.ia.ac.cn |
| Pseudocode | No | The paper includes mathematical equations and derivations but does not present any pseudocode or algorithm blocks explicitly labeled as such or formatted like an algorithm. |
| Open Source Code | Yes | Codes are available in https://github.com/Mengyuan Chen21/Neur IPS2024-CSP. |
| Open Datasets | Yes | We mainly evaluate our method on the widely-used Image Net-1k OOD detection benchmark [26]. This benchmark utilizes the large-scale Image Net-1k dataset as the ID data, and select samples from i Naturalist [60], SUN [66], Places [73], and Textures [8] as the OOD data. The categories of the OOD data have been manually selected to prevent overlap with Image Net-1k. |
| Dataset Splits | Yes | Image Net-1k, also referred to as ILSVRC 2012, is a subset of the larger Image Net dataset [9]. This dataset encompasses 1,000 object classes and includes 1,281,167 images for training, 50,000 images for validation, and 100,000 images for testing. |
| Hardware Specification | Yes | All experiments are conducted using Ge Force RTX 3090 GPUs. |
| Software Dependencies | No | The paper mentions key software components like 'CLIP Vi T-B/16 model' and 'Word Net' but does not specify their version numbers to ensure reproducibility of software dependencies. |
| Experiment Setup | Yes | Unless otherwise specified, we employ the CLIP Vi T-B/16 model as the pre-trained VLM and Word Net as the lexicon. The superclass set for constructing the conjugated semantic pool is {area, creature, environment, item, landscape, object, pattern, place, scene, space, structure, thing, view, vista}, which nearly encompasses all real-world objects. The ablation in Appendix C.5 shows that numerous alternative selections can also yield significant performance improvements. All hyper-parameters are directly inherited from [29] without any modification, including the ratio r which is set to 15%. |