MOFI: Learning Image Representations from Noisy Entity Annotated Images

Authors: Wentao Wu, Aleksei Timofeev, Chen Chen, Bowen Zhang, Kun Duan, Shuangning Liu, Yantao Zheng, Jonathon Shlens, Xianzhi Du, Yinfei Yang

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments show that supervised pre-training with large-scale fine-grained entity labels is highly effective for image retrieval tasks, and multi-task training further improves the performance. The final MOFI model achieves 86.66% m AP on the challenging GPR1200 dataset, surpassing the previous state-of-the-art performance of 72.19% from Open AI s CLIP model. Further experiments on zero-shot and linear probe image classification also show that MOFI outperforms a CLIP model trained on the original image-text data, demonstrating the effectiveness of the I2E dataset in learning strong image representations.
Researcher Affiliation Industry Apple AI/ML {wentao wu,a timofeev,xianzhi,yinfeiy}@apple.com
Pseudocode No The paper describes methods in text and provides mathematical equations, but it does not include any clearly labeled pseudocode or algorithm blocks.
Open Source Code Yes We release our code and model weights at https://github.com/apple/ml-mofi.
Open Datasets Yes We first evaluate the models on image retrieval tasks on GPR1200 (Schall et al., 2021) and Image Net1K (Russakovsky et al., 2015)7.
Dataset Splits Yes For Image Net, we modify its validation set to use as an image retrieval evaluation set.
Hardware Specification No The paper mentions using '256 machines' for data annotation, but it does not provide specific hardware details (e.g., CPU/GPU models, memory) for the actual model training or experiments.
Software Dependencies No The paper mentions using the 'OPT tokenizer' and 'Adam W optimizer' and states that 'All experiments are conducted using AXlearn6,' but it does not provide specific version numbers for these software components or libraries.
Experiment Setup Yes All models are trained with 224x224 input image size using the Adam W optimizer (Loshchilov & Hutter, 2017) with weight decay 0.1 and learning rate 0.0008, except that MOFI-L/14 uses a learning rate of 0.0006. The learning rate is first warmed up to 10,000 steps, and cosine decay is applied until the last training step. Due to the computation limit, we train the CLIP models for 600k steps with global batch size 32,768, and train the other models for 1.2M steps with global batch size 16,384, so all the models have seen the same number of training examples. The number of entities N used in classification for each batch is set to 512k.