Perturb, Predict & Paraphrase: Semi-Supervised Learning using Noisy Student for Image Captioning
Authors: Arjit Jain, Pranay Reddy Samala, Preethi Jyothi, Deepak Mittal, Maneesh Singh
IJCAI 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this work, we provide an in-depth analysis of the noisy student SSL framework for the task of image captioning and derive state-of-the-art results. Our final results in the limited labeled data setting (1% of the MS-COCO labeled data) outperform previous state-of-the-art approaches by 2.5 on BLEU4 and 11.5 on CIDEr scores. |
| Researcher Affiliation | Collaboration | 1Indian Institute of Technology Bombay 2Verisk Analytics {arjit,pranayr,pjyothi}@cse.iitb.ac.in,{deepak.mittal,maneesh.singh}@verisk.com |
| Pseudocode | Yes | Algorithm 1 Noisy Student Training for Captioning. Input: N, L, UI, UC, Paraphraser P, Student model S, Teacher model T |
| Open Source Code | Yes | Code, models, and datasets will be made publicly available at https://github.com/csalt-research/perturb-predict-paraphrase. |
| Open Datasets | Yes | We conduct experiments on the MSCOCO dataset [Lin et al., 2014], the standard benchmark used for image captioning. ... For unlabeled data, we use the Unlabeled COCO split from the official MSCOCO Caption challenge. |
| Dataset Splits | Yes | We adopt the standard Karpathy split used in all prior work, with 113k images used in training, and 5k images each used for validation and testing. |
| Hardware Specification | No | No specific hardware details such as GPU or CPU models, processor types, or memory specifications are provided for running the experiments. The paper mentions models like Faster-RCNN and BART, implying computational resources were used, but no specific hardware is listed. |
| Software Dependencies | No | The paper mentions software components like Attention on Attention Network (Ao ANet), Faster-RCNN, BART, and BERT, but it does not specify version numbers for any of these or for underlying frameworks like PyTorch or TensorFlow, or Python. |
| Experiment Setup | Yes | Beam decoding is used for evaluation with the beam width set to 5. ... Unless specified otherwise, we use beam decoding to generate pseudo labels with a beam width of 2. For the teacher model, we use model dropout with probability p = 0.3, no object dropout and label smoothing with probability 0.1. The student model is randomly initialized, and trained from scratch. The labeled batch size is 16, with 5 captions per image, and the unlabeled batch size is 96 with 1 caption per image. The number of noisy student iterations N = 1. |