Scheduled Sampling in Vision-Language Pretraining with Decoupled Encoder-Decoder Network

Authors: Yehao Li, Yingwei Pan, Ting Yao, Jingwen Chen, Tao Mei8518-8526

AAAI 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments demonstrate the compelling generalizability of our pretrained encoder-decoder by fine-tuning on four VL understanding and generation downstream tasks. Through an extensive set of experiments on four VL understanding and generation downstream tasks, we demonstrate that our pre-trained TDEN achieves new state-of-the-art performances for each task.
Researcher Affiliation Collaboration 1 JD AI Research, Beijing, China 2 Sun Yat-sen University, Guangzhou, China
Pseudocode No The paper describes the architecture and processes, but does not include any explicitly labeled pseudocode blocks or algorithms formatted as code.
Open Source Code Yes Source code is available at https://github.com/Yeh Li/TDEN.
Open Datasets Yes We conduct the experiments for pretraining over the large-scale image captioning benchmark Conceptual Captions (Sharma et al. 2018). VQA 2.0 (Antol et al. 2015) is adopted for finetuning our TDEN, which consists of 1.1 million questions about images in COCO (Chen et al. 2015). We utilize Flickr30k (Plummer et al. 2015) in this task and each image is equipped with five human-annotated sentences. The Visual Commonsense Reasoning (VCR) benchmark (Zellers et al. 2019) is utilized for evaluation. COCO (Chen et al. 2015) is utilized for fine-tuning and evaluating TDEN.
Dataset Splits Yes During finetuning, we follow the official split (Anderson et al. 2018) and formulate this task as a multi-label classification problem. We follow the commonly adopted split in (Lee et al. 2018) and formulate this task as a ranking problem that sorts images according to the image-sentence similarities, which are measured as in ISM. We utilize the widely adopted Karpathy split (Karpathy and Fei-Fei 2015; Yao et al. 2017b, 2018, 2019) for evaluation.
Hardware Specification Yes We implement the whole architecture with Py Torch (Paszke et al. 2019), optimized with Adam (Kingma and Ba 2015) on 16 Tesla P40 GPUs.
Software Dependencies No The paper mentions 'Py Torch' but does not specify a version number. Other software or libraries are mentioned without versions.
Experiment Setup Yes During pretraining, ... The mini-batch size is 1,024 and learning rate is set as 0.0001. The maximum iteration is 10 epoches. Finetuning Data and Details on Downstream Tasks. ... cross-entropy loss (mini-batch size: 96, learning rate: 0.00005, maximum iteration: 20 epoches). ... triplet ranking loss (mini-batch size: 512, learning rate: 0.00002, maximum iteration: 30 epoches). ... cross-entropy loss (mini-batch size: 64, learning rate: 0.00002, maximum iteration: 20 epoches). ... mini-batch size is 16 and the learning rate is 0.00003. We set the maximum iteration as 10 epoches. The learning rate is 0.000005 and the maximum iteration is 30 epoches.