Label-Focused Inductive Bias over Latent Object Features in Visual Classification

Authors: Ilmin Kang, HyounYoung Bae, Kangil Kim

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experiments on an image classification task show that LLB improves performance in both quantitative and qualitative analyses
Researcher Affiliation Academia AI Graduate School, GIST , Republic of Korea {kangilmin0325, bonheur606060, kangilkim}@gmail.com
Pseudocode No The paper describes the steps of the LLB method in narrative text and with diagrams (Figure 3), but it does not include a formally labeled 'Pseudocode' or 'Algorithm' block.
Open Source Code Yes The codes are available at https://github.com/GIST-IRR/LLB
Open Datasets Yes We use standard Image Net (IN1K)(Deng et al., 2009), which consist of 1.28M training images with 1000 classes. We also use additional benchmarks including IN reassessed labels Image Net Real (IN-Real) (Beyer et al., 2020), scene recognition dataset Places356-Standard (Places) (L opez Cifuentes et al., 2020), fine-grained and long-tailed i Naturalist2018 (i Nat18) (Van Horn et al., 2018) dataset.
Dataset Splits Yes We use standard Image Net (IN1K)(Deng et al., 2009), which consist of 1.28M training images with 1000 classes. We also use additional benchmarks including IN reassessed labels Image Net Real (IN-Real) (Beyer et al., 2020), scene recognition dataset Places356-Standard (Places) (L opez Cifuentes et al., 2020), fine-grained and long-tailed i Naturalist2018 (i Nat18) (Van Horn et al., 2018) dataset. For baselines, we first followed (Dosovitskiy et al., 2020; Steiner et al., 2021) to get vanilla Vi T pre-trained using Image Net21K (IN21K) (Ridnik et al., 2021).
Hardware Specification Yes Our experiments are on 8 A100 with additional 4 A6000 GPUs for both reproduce baselines and training LLB.
Software Dependencies No The paper mentions 'Optimizer Adam (Kingma & Ba, 2014)' and 'Rand Augment (Cubuk et al., 2020)' but does not provide specific version numbers for these or any other software libraries or dependencies.
Experiment Setup Yes Our LLB is built upon pre-trained classical visual feature backbones. We extract hidden vectors from backbone while keeping the backbone parameter freezed. For backbone, we use Vi T (Dosovitskiy et al., 2020) networks. We use l V -th layer and consider them as visual features Vl V = [v0 l V ; v1 l V ; ; vi l V ]. We found that, extraction from l V = LV 1 layer showed best performance (Figure 7a in Appendix). LLB takes Vl V and cluster them into O latent objects. Based on our experiments (Figure 7b in Appendix), we use O=2048. We report the results of different α in Figure 7d in Appendix, and selected best one among them. Additional model settings are summarized in Table 3. We train our model with Cross Entropy loss with additional object diversity regularization term in Equation ( 2). Look at Table 4 in Appendix for detailed hyper-parameters we used.