Frozen Transformers in Language Models Are Effective Visual Encoder Layers
Authors: Ziqi Pang, Ziyang Xie, Yunze Man, Yu-Xiong Wang
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We demonstrate that our approach consistently enhances performance across a diverse range of tasks, encompassing purely 2D and 3D visual recognition tasks (e.g., image and point cloud classification), temporal modeling tasks (e.g., action recognition), non-semantic tasks (e.g., motion forecasting), and multi-modal tasks (e.g., 2D/3D visual question answering and image-text retrieval). Such improvements are a general phenomenon, applicable to various types of LLMs (e.g., LLa MA and OPT) and different LLM transformer blocks. |
| Researcher Affiliation | Academia | Ziqi Pang Ziyang Xie Yunze Man Yu-Xiong Wang University of Illinois Urbana-Champaign {ziqip2,ziyang8,yunzem2,yxw}@illinois.edu |
| Pseudocode | Yes | (b) Pseudo-Code def __init__(self, *args, **kawargs): # Encoder self.Vi T = Encoder(args, kwargs) self.classifier = Decoder(args, kwargs) # Language Transformer self.L1 = nn.Linear(Vi T.hidden_dim, LM.hidden_dim) self.LM = LM_Transformer(args, kwargs) self.L2 = nn.Linear(LM.hidden_dim, Vi T.hidden_dim) # Freezing for param in self.LM.parameters(): param.requires_grad = False def forward(self, img): z = self.Vi T(x) z = self.L1(z) z = self.LM(z) z = self.L2(z) y = self.classifier(z) return y |
| Open Source Code | Yes | https://github.com/ziqipang/LM4Visual Encoding |
| Open Datasets | Yes | We conduct experiments on Image Net1k (Deng et al., 2009)... Our experiments cover two common datasets: Scan Object NN (Uy et al., 2019) and Model Net40 (Goyal et al., 2021)... We choose the benchmark of Something-something-v2 (SSv2) (Goyal et al., 2017b)... We conduct experiments on Argoverse (Chang et al., 2019)... We experiment with 2D vision-language (VL) tasks, including visual question answering (VQA) on VQAv2 (Goyal et al., 2017c) and zero-shot image retrieval on Flickr30k (Plummer et al., 2015)... We conduct our experiments on the SQA3D (Ma et al., 2023) dataset... |
| Dataset Splits | Yes | We train both the baseline Vi T models and Vi T+LLa MA from scratch following the same configuration of Dei T (Touvron et al., 2021). ... we strictly follow the same hyper-parameters and configurations on VQAv2 (Goyal et al., 2017c) and Flickr30k (Plummer et al., 2015) provided by METER. |
| Hardware Specification | Yes | The total time for training lasts 4-6 days on 4 A100 GPUs. ... The training takes around 6-10 hours on 4 A100 GPUs. ... Finally, the Vi T-S-LLa MA and Vi T-B-LLa MA experiments take around 3-4 days on 4 A100 GPUs. ... The training time for both models is around 2 days on one A100 GPU. ... Finally, each training on VQAv2 and Flickr30k lasts for 10 epochs and around 1 day on 4 A100 GPUs. ... The model is trained on a single A100 GPU. |
| Software Dependencies | No | The paper mentions 'Pytorch-style pseudo-code' and refers to various optimizers (Adam W) and learning rate schedules (cosine annealing), but it does not specify software dependencies with version numbers (e.g., PyTorch version, specific library versions). |
| Experiment Setup | Yes | The most important configurations include a total of 300 epochs, a base learning rate of 5e-4, a cosine annealing learning rate schedule (Loshchilov & Hutter, 2016), and an Adam W optimizer (Kingma & Ba, 2014; Loshchilov & Hutter, 2017). The only change we adopt is the warm-up length of 20 epochs, compared with the original warm-up of 10 epochs in Dei T. |