Perturb, Predict & Paraphrase: Semi-Supervised Learning using Noisy Student for Image Captioning

Authors: Arjit Jain, Pranay Reddy Samala, Preethi Jyothi, Deepak Mittal, Maneesh Singh

IJCAI 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this work, we provide an in-depth analysis of the noisy student SSL framework for the task of image captioning and derive state-of-the-art results. Our final results in the limited labeled data setting (1% of the MS-COCO labeled data) outperform previous state-of-the-art approaches by 2.5 on BLEU4 and 11.5 on CIDEr scores.
Researcher Affiliation Collaboration 1Indian Institute of Technology Bombay 2Verisk Analytics {arjit,pranayr,pjyothi}@cse.iitb.ac.in,{deepak.mittal,maneesh.singh}@verisk.com
Pseudocode Yes Algorithm 1 Noisy Student Training for Captioning. Input: N, L, UI, UC, Paraphraser P, Student model S, Teacher model T
Open Source Code Yes Code, models, and datasets will be made publicly available at https://github.com/csalt-research/perturb-predict-paraphrase.
Open Datasets Yes We conduct experiments on the MSCOCO dataset [Lin et al., 2014], the standard benchmark used for image captioning. ... For unlabeled data, we use the Unlabeled COCO split from the official MSCOCO Caption challenge.
Dataset Splits Yes We adopt the standard Karpathy split used in all prior work, with 113k images used in training, and 5k images each used for validation and testing.
Hardware Specification No No specific hardware details such as GPU or CPU models, processor types, or memory specifications are provided for running the experiments. The paper mentions models like Faster-RCNN and BART, implying computational resources were used, but no specific hardware is listed.
Software Dependencies No The paper mentions software components like Attention on Attention Network (Ao ANet), Faster-RCNN, BART, and BERT, but it does not specify version numbers for any of these or for underlying frameworks like PyTorch or TensorFlow, or Python.
Experiment Setup Yes Beam decoding is used for evaluation with the beam width set to 5. ... Unless specified otherwise, we use beam decoding to generate pseudo labels with a beam width of 2. For the teacher model, we use model dropout with probability p = 0.3, no object dropout and label smoothing with probability 0.1. The student model is randomly initialized, and trained from scratch. The labeled batch size is 16, with 5 captions per image, and the unlabeled batch size is 96 with 1 caption per image. The number of noisy student iterations N = 1.