Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Perturb, Predict & Paraphrase: Semi-Supervised Learning using Noisy Student for Image Captioning
Authors: Arjit Jain, Pranay Reddy Samala, Preethi Jyothi, Deepak Mittal, Maneesh Singh
IJCAI 2021 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this work, we provide an in-depth analysis of the noisy student SSL framework for the task of image captioning and derive state-of-the-art results. Our final results in the limited labeled data setting (1% of the MS-COCO labeled data) outperform previous state-of-the-art approaches by 2.5 on BLEU4 and 11.5 on CIDEr scores. |
| Researcher Affiliation | Collaboration | 1Indian Institute of Technology Bombay 2Verisk Analytics EMAIL,EMAIL |
| Pseudocode | Yes | Algorithm 1 Noisy Student Training for Captioning. Input: N, L, UI, UC, Paraphraser P, Student model S, Teacher model T |
| Open Source Code | Yes | Code, models, and datasets will be made publicly available at https://github.com/csalt-research/perturb-predict-paraphrase. |
| Open Datasets | Yes | We conduct experiments on the MSCOCO dataset [Lin et al., 2014], the standard benchmark used for image captioning. ... For unlabeled data, we use the Unlabeled COCO split from the official MSCOCO Caption challenge. |
| Dataset Splits | Yes | We adopt the standard Karpathy split used in all prior work, with 113k images used in training, and 5k images each used for validation and testing. |
| Hardware Specification | No | No specific hardware details such as GPU or CPU models, processor types, or memory specifications are provided for running the experiments. The paper mentions models like Faster-RCNN and BART, implying computational resources were used, but no specific hardware is listed. |
| Software Dependencies | No | The paper mentions software components like Attention on Attention Network (Ao ANet), Faster-RCNN, BART, and BERT, but it does not specify version numbers for any of these or for underlying frameworks like PyTorch or TensorFlow, or Python. |
| Experiment Setup | Yes | Beam decoding is used for evaluation with the beam width set to 5. ... Unless specified otherwise, we use beam decoding to generate pseudo labels with a beam width of 2. For the teacher model, we use model dropout with probability p = 0.3, no object dropout and label smoothing with probability 0.1. The student model is randomly initialized, and trained from scratch. The labeled batch size is 16, with 5 captions per image, and the unlabeled batch size is 96 with 1 caption per image. The number of noisy student iterations N = 1. |