DenseDINO: Boosting Dense Self-Supervised Learning with Token-Based Point-Level Consistency

Authors: Yike Yuan, Xinghe Fu, Yunlong Yu, Xi Li

IJCAI 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental 4 Experiment We evaluate our method on both classification and semantic segmentation tasks. Table 1: Comparison (%) with other methods. 4.3 Ablation Study
Researcher Affiliation Academia Yike Yuan1 , Xinghe Fu1 , Yunlong Yu2 and Xi Li13 1College of Computer Science and Technology, Zhejiang University 2College of Information Science and Electronic Engineering, Zhejiang University 3Zhejiang Singapore Innovation and AI Joint Research Lab, Hangzhou {yuanyike, xinghefu, yuyunlong, xilizju}@zju.edu.cn
Pseudocode No The paper does not include structured pseudocode or algorithm blocks.
Open Source Code No The paper does not provide an explicit statement about releasing its own source code or a link to a code repository for the described methodology.
Open Datasets Yes We choose Vi T[Dosovitskiy et al., 2021] as the backbone and Image Net[Russakovsky et al., 2015] as the training dataset. ... For semantic segmentation, we adopt linear probing protocol following [Ziegler and Asano, 2022]. We train a 1 1 convolutional layer on the frozen patch tokens on Pascal VOC 2012[Everingham et al., 2010] train + aug split and report m IOU on the valid split.
Dataset Splits Yes We train a 1 1 convolutional layer on the frozen patch tokens on Pascal VOC 2012[Everingham et al., 2010] train + aug split and report m IOU on the valid split.
Hardware Specification No The paper does not provide specific hardware details such as GPU or CPU models used for the experiments.
Software Dependencies No The paper mentions various models and frameworks (e.g., ViT, DINO, Leopart) but does not provide specific software dependencies with version numbers (e.g., Python, PyTorch versions).
Experiment Setup Yes We train Vi T-Small with 4 views and 4 reference points in each pair of views for 300 epochs for the performance comparison in main results, and Vi T-Tiny with 6 views and 4 reference points for 100 epochs for the ablation study. The loss weight α is set as 0.5. ... All models in the experiment are with patch size 16 and trained from scratch unless specified otherwise. Other training parameters are kept the same with the setting of DINO.