LoTLIP: Improving Language-Image Pre-training for Long Text Understanding

Authors: Wei Wu, Kecheng Zheng, Shuailei Ma, Fan Lu, Yuxin Guo, Yifei Zhang, Wei Chen, Qingpei Guo, Yujun Shen, Zheng-Jun Zha

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental 5 Experiments
Researcher Affiliation Collaboration Wei Wu1 Kecheng Zheng2,3 Shuailei Ma4 Fan Lu1 Yuxin Guo5 Yifei Zhang6 Wei Chen3 Qingpei Guo2 Yujun Shen2 Zheng-Jun Zha1 1University of Science and Technology of China 2Ant Group 3Zhejiang University 4Northeastern University, China 5Institute of Automation, Chinese Academy of Sciences 6Shanghai Jiao Tong University
Pseudocode No The paper does not contain a pseudocode block or algorithm figure.
Open Source Code No The code for this paper requires approval before it can be made open source, hence it is not provided in this submission. However, the code, models, and datasets of this paper will be made publicly accessible after this submission to ensure the reproducibility of the experiments and to foster research progress within the community.
Open Datasets Yes To construct long text-image pairs for language-image pre-training, we re-captioned 100 million images with long texts. Specifically, we collected the images from CC3M [27], CC12M [27], YFCC15M [29], LAION [26], and COYO [2] dataset.
Dataset Splits No We conduct ablation studies to validate our model on the 3M scale pre-training data. The performance of Lo TLIP pre-trained with 12M and 30M scale datasets is shown in the Supplementary Material. ... For short-image-text retrieval, we evaluate on MSCOCO [18] and Flickr30k Caption [35] and report Recall at 1/5 (R@1/5) metric for comparison. For image classification, we use Image Net1k [9] for evaluation and present top-1 accuracy (Acc@1) on image classification.
Hardware Specification Yes To obtain our Lo TLIP trained with 100M scale dataset, we apply A100 GPU with 80G memory for training, which costs about 133 GPU days.
Software Dependencies No The paper mentions using BERT [10] as the text encoder and a vision transformer pre-trained on Image Net 21K as the image encoder, specifically Vi T-B/16. It also mentions Instruct BLIP [7], LLa VA [20], and Share GPT4V [6] for caption generation. However, it does not specify software versions for programming languages or libraries (e.g., Python, PyTorch, TensorFlow versions).
Experiment Setup Yes The maximum text token length is set to 128 unless specifically stated. Three consecutive sub-captions are randomly selected to form long texts as text input. We train 10 epochs on the 3M and 100M scale datasets. For the 3M dataset, the batch size is set to 2560, while that of 100M is set to 16384. The other pre-training hyperparameters are under the same setting, e.g.learning rate, warmup steps, and weight decay.