Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Generalized Out-of-Distribution Detection and Beyond in Vision Language Model Era: A Survey

Authors: Atsuyuki Miyai, Jingkang Yang, Jingyang Zhang, Yifei Ming, Yueqian Lin, Qing Yu, Go Irie, Shafiq Joty, Yixuan Li, Hai Helen Li, Ziwei Liu, Toshihiko Yamasaki, Kiyoharu Aizawa

TMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this section, we compare several representative VLM-based OOD detection methods. ... Table 3: Comparison of OOD detection methods across Image Net, Image Net-20, and Image Net-X. We use AUROC for the evaluation of OOD detection.
Researcher Affiliation Collaboration Atsuyuki Miyai EMAIL The University of Tokyo Jingkang Yang EMAIL S-Lab, Nanyang Technological University Jingyang Zhang EMAIL Duke University Yifei Ming EMAIL Salesforce AI Research
Pseudocode No The paper describes methodologies in natural language and does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code Yes The resource is available at https://github.com/Atsu Miyai/Awesome-OOD-VLM.
Open Datasets Yes In this section, we report results on the widely used Image Net OOD benchmark (Huang & Li, 2021) and two Image Net-based hard OOD benchmark. ... MVTec-AD dataset (Bergmann et al., 2019) and Vis A dataset (Zou et al., 2022) are commonly used.
Dataset Splits Yes In the Image Net OOD benchmark, Image Net is used as the ID dataset, while datasets such as i Naturalist (Van Horn et al., 2018) serve as OOD datasets. ... Image Net-20 is used as the ID dataset, and Image Net-10, which has no overlapping categories, is used as the OOD dataset. ... Both ID and OOD sets consist of 500 classes each. The ID and OOD subsets of Image Net-X contain 25,000 images respectively.
Hardware Specification No The paper does not explicitly describe the specific hardware (e.g., GPU models, CPU models, memory) used to run the experiments.
Software Dependencies No The paper mentions using pre-trained models and frameworks like CLIP, GPT-4V, LLaVA, Grounding DINO, and SAM, but does not provide specific version numbers for these or any underlying software dependencies (e.g., Python, PyTorch versions).
Experiment Setup Yes For Co Op and Lo Co Op, we follow the hyperparameter settings from previous studies and train with 16 shots. On the other hand, since IDPrompt requires higher training costs, we follow the original implementation (Bai et al., 2024a) and conduct training with only 1 shot.