Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Holistic Order Prediction in Natural Scenes
Authors: Pierre Musacchio, Hyunmin Lee, Jaesik Park
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We comprehensively benchmark and ablate our approach to highlight its effectiveness. Our code and models are open-source and available at this URL: https://github.com/SNU-VGILab/Insta Order. |
| Researcher Affiliation | Collaboration | Pierre Musacchio Seoul National University EMAIL Hyunmin Lee LG AI Research EMAIL Jaesik Park Seoul National University EMAIL |
| Pseudocode | No | The paper describes the architecture and methodology of Insta Former in text and diagrams (Figure 3), but it does not present any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | Our code and models are open-source and available at this URL: https://github.com/SNU-VGILab/Insta Order. |
| Open Datasets | Yes | We run experiments on INSTAORDER [24]. We convert INSTAORDER to a VQA version, i.e., INSTAORDER-VQA, to evaluate LLa VA [28] on occlusion and depth order prediction in zero-shot and finetuned manners (denoted with in Tab. 1). We release the conversion script and the dataset with this work. We refer the reader to Appendix B for more information. |
| Dataset Splits | Yes | Converting the annotations of the 4,071 images of the validation set of INSTAORDER to INSTAORDERVQA yields a total of 178,539 VQA ordering prompts, along with their respective ground truths. ... We start by splitting the dataset into four categories: occlusion prompts to which the answer is Yes, occlusion prompts to which the answer is no, depth prompts to which the answer is yes and depth prompt to which the answer is No. We then randomly shuffle the samples in each category. ... We subsample the training set of INSTAORDERVQA to 2% of the annotations. This means that we sample 58K annotations for training in total. We evenly select those annotations from the four categories mentioned in the previous paragraph to obtain a balanced dataset. |
| Hardware Specification | Yes | For all experiments, we train on 4 NVIDIA RTX A6000 for 120,000 iterations... Finetuning. We observed that finetuning the model directly after the pre-training stage results in mode collapse... We finetune LLa VA for a single epoch on 8 NVIDIA RTX A6000 with a batch size of 16 per GPU using the finetuning scripts from the official repository. ... We benchmark the runtime and memory cost of Insta Order Neto,d and our Insta Formero,d. All measures are recorded on a single NVIDIA RTX A6000. |
| Software Dependencies | No | The paper mentions various models and optimizers like LLaVA [27], Mask2Former [9], Adam W [30], BCE, and CE losses, but does not specify versions for programming languages (e.g., Python), deep learning frameworks (e.g., PyTorch, TensorFlow), or low-level libraries (e.g., CUDA) that are typically considered software dependencies. |
| Experiment Setup | Yes | For all experiments, we train on 4 NVIDIA RTX A6000 for 120,000 iterations using Adam W [30] with learning rate 10-5 and reduce it to 10-6 and 10-7 at iterations 80,000 and 110,000 respectively, as suggested by [15]. We use a batch size of 16 and use the BCE an CE losses for occlusion and depth order respectively. We resize the input image to 1024 1024 and use Random Flip during training, following [9]. For evaluation, we follow the default R-CNN baseline, which consists of resizing the smaller size of the image to 800 pixels and the longer one to 1333. We use 8 heads, 512-dimensional linear projections for all the attention layers, and 2-layered FFNs with 2,048 hidden nodes in all the transformer layers. The encoder consists of a single transformer layer that simply creates a global descriptor for each mask. On the other hand, the decoder comprises eight transformer layers. We add auxiliary losses on all transformer layers of the transformer decoder. This results in a 34M parameter geometrical ordering predictor. We initialize the entire ordering module using Xavier initialization [12]. We set the dimension of all the adapters to 64 and initialize their weights with Kaiming uniform [13] and their biases with zero initialization following [6]. |