Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Sample-Efficient Multi-Round Generative Data Augmentation for Long-Tail Instance Segmentation
Authors: Byunghyun Kim, Minyoung Bae, Jae-Gil Lee
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our framework requires only 6% of the data generation needed by state-of-the-art methods while outperforming them. Table 1 and Table 2 show the mean average precision (m AP) on the instance segmentation and object detection tasks on the LVIS v1.0 dataset for the Res Net-50 [20] and Swin-L [27] backbones. The superscripts box and mask denote the m AP for object detection and instance segmentation, respectively. The subscripts r , c , and f signify rare, common, and frequent classes, respectively. Overall, MRCA outperforms the state-of-the-art generative augmentation methods while generating only 6% as many synthetic objects. |
| Researcher Affiliation | Academia | Byunghyun Kim, Minyoung Bae, Jae-Gil Lee KAIST EMAIL |
| Pseudocode | Yes | A Algorithm Pseudocode Algorithm 1 Multi-round collaborative augmentation pipeline Algorithm 2 Denoising the diffusion model with a criterion function Algorithm 3 Applying the criterion function to the instance segmentation model Algorithm 4 Computing a scale factor based on average precision (AP) |
| Open Source Code | Yes | The source code is publicly available at https: //github.com/kaist-dmlab/MRCA. |
| Open Datasets | Yes | Instance segmentation and object detection experiments are mainly conducted on the LVIS v1.0 dataset [18] as in the relevant literature, with supporting experiments on the Pascal VOC dataset [10] and the Open Images V5 dataset [24]. |
| Dataset Splits | Yes | Instance segmentation and object detection experiments are mainly conducted on the LVIS v1.0 dataset [18] as in the relevant literature, with supporting experiments on the Pascal VOC dataset [10] and the Open Images V5 dataset [24]. ... We evaluate performance using average precision (AP) for both bounding box detection and instance segmentation while also analyzing results across different class frequencies (rare, common, and frequent classes) as defined in LVIS. ... For Open Images V5 [24], due to the large size of the dataset, we use a pareto sampling method from previous work [32] to create its long-tailed version. |
| Hardware Specification | Yes | We use 8 NVIDIA Ge Force RTX 3090 GPUs, where 4 GPUs are used for training the model, 3 GPUs for generating objects, and 1 GPU for segmentation. Only for the Swin-L experiment, we use NVIDIA A40 GPUs instead to fit the model in the GPU memory. |
| Software Dependencies | No | Center Net2 [39], with a Res Net-50 [20] backbone, is used as our main instance segmentation model, implemented in Detectron2 [33]. The Swin-L [27] backbone is also used for comparison with baseline methods. Stable Diffusion 3 Medium [9] is used as our generation model, and Bi Ref Net [38] is used for segmentation on the generated objects. After generation or segmentation, CLIP [29] is used to filter out the low-quality objects whose score is smaller than 0.25. |
| Experiment Setup | Yes | Our model is trained for 10 rounds, with each round consisting of 9,000 iterations and a batch size of 16. Our multi-round collaborative augmentation serves as the sole modification to the baseline training pipeline. We evaluate performance using average precision (AP) for both bounding box detection and instance segmentation while also analyzing results across different class frequencies (rare, common, and frequent classes) as defined in LVIS. We use 8 NVIDIA Ge Force RTX 3090 GPUs, where 4 GPUs are used for training the model, 3 GPUs for generating objects, and 1 GPU for segmentation. Only for the Swin-L experiment, we use NVIDIA A40 GPUs instead to fit the model in the GPU memory. To run both training and generation without waiting for another process, the number of object generations per round is empirically set to 6 1203 (number of classes in LVIS). Besides, γ and ω in Eq. (7) are set to 5.0 and 0.03, respectively. Further details about the experiments are provided in Appendix C. The hyperparameters of the diffusion models are set equal, with classifier guidance scale as 5.0, generation step as 30, and image resolution as 512 512. |