Frozen Transformers in Language Models Are Effective Visual Encoder Layers

Authors: Ziqi Pang, Ziyang Xie, Yunze Man, Yu-Xiong Wang

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We demonstrate that our approach consistently enhances performance across a diverse range of tasks, encompassing purely 2D and 3D visual recognition tasks (e.g., image and point cloud classification), temporal modeling tasks (e.g., action recognition), non-semantic tasks (e.g., motion forecasting), and multi-modal tasks (e.g., 2D/3D visual question answering and image-text retrieval). Such improvements are a general phenomenon, applicable to various types of LLMs (e.g., LLa MA and OPT) and different LLM transformer blocks.
Researcher Affiliation Academia Ziqi Pang Ziyang Xie Yunze Man Yu-Xiong Wang University of Illinois Urbana-Champaign {ziqip2,ziyang8,yunzem2,yxw}@illinois.edu
Pseudocode Yes (b) Pseudo-Code def __init__(self, *args, **kawargs): # Encoder self.Vi T = Encoder(args, kwargs) self.classifier = Decoder(args, kwargs) # Language Transformer self.L1 = nn.Linear(Vi T.hidden_dim, LM.hidden_dim) self.LM = LM_Transformer(args, kwargs) self.L2 = nn.Linear(LM.hidden_dim, Vi T.hidden_dim) # Freezing for param in self.LM.parameters(): param.requires_grad = False def forward(self, img): z = self.Vi T(x) z = self.L1(z) z = self.LM(z) z = self.L2(z) y = self.classifier(z) return y
Open Source Code Yes https://github.com/ziqipang/LM4Visual Encoding
Open Datasets Yes We conduct experiments on Image Net1k (Deng et al., 2009)... Our experiments cover two common datasets: Scan Object NN (Uy et al., 2019) and Model Net40 (Goyal et al., 2021)... We choose the benchmark of Something-something-v2 (SSv2) (Goyal et al., 2017b)... We conduct experiments on Argoverse (Chang et al., 2019)... We experiment with 2D vision-language (VL) tasks, including visual question answering (VQA) on VQAv2 (Goyal et al., 2017c) and zero-shot image retrieval on Flickr30k (Plummer et al., 2015)... We conduct our experiments on the SQA3D (Ma et al., 2023) dataset...
Dataset Splits Yes We train both the baseline Vi T models and Vi T+LLa MA from scratch following the same configuration of Dei T (Touvron et al., 2021). ... we strictly follow the same hyper-parameters and configurations on VQAv2 (Goyal et al., 2017c) and Flickr30k (Plummer et al., 2015) provided by METER.
Hardware Specification Yes The total time for training lasts 4-6 days on 4 A100 GPUs. ... The training takes around 6-10 hours on 4 A100 GPUs. ... Finally, the Vi T-S-LLa MA and Vi T-B-LLa MA experiments take around 3-4 days on 4 A100 GPUs. ... The training time for both models is around 2 days on one A100 GPU. ... Finally, each training on VQAv2 and Flickr30k lasts for 10 epochs and around 1 day on 4 A100 GPUs. ... The model is trained on a single A100 GPU.
Software Dependencies No The paper mentions 'Pytorch-style pseudo-code' and refers to various optimizers (Adam W) and learning rate schedules (cosine annealing), but it does not specify software dependencies with version numbers (e.g., PyTorch version, specific library versions).
Experiment Setup Yes The most important configurations include a total of 300 epochs, a base learning rate of 5e-4, a cosine annealing learning rate schedule (Loshchilov & Hutter, 2016), and an Adam W optimizer (Kingma & Ba, 2014; Loshchilov & Hutter, 2017). The only change we adopt is the warm-up length of 20 epochs, compared with the original warm-up of 10 epochs in Dei T.