Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
FORLA: Federated Object-Centric Representation Learning with Slot Attention
Authors: Guiqiu Liao, Matjaz Jogan, Eric Eaton, Daniel Hashimoto
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments in multiple real-world datasets show that our framework not only outperforms centralized baselines on object discovery but also learns a compact, universal representation that generalizes well across domains. |
| Researcher Affiliation | Academia | 1PCASO Laboratory, Dept. of Surgery, University of Pennsylvania 2Dept. of Computer and Information Science, University of Pennsylvania |
| Pseudocode | Yes | Algorithm 1 Federated Two-Branch Slot Attention Training in FORLA |
| Open Source Code | Yes | Our code, data, and pretrained models are available at: https://github.com/PCASOlab/FORLA. |
| Open Datasets | Yes | Dataset We evaluate our experiment on seven datasets, categorized into two groups: 1) Surgical vision datasets: including the abdominal surgical dataset [72], Cholec80 dataset [58], and a proprietary thoracic surgery dataset. 2) Natural vision datasets: including COCO [36], PASCAL VOC 2012 [13], You Tube-VIS [66], and You Tube-Objects [47]. In total, these datasets comprise approximately 1.4 million images. With the exception of the proprietary thoracic dataset, all data is publicly available.1 Further details are provided in Supplementary Material E. |
| Dataset Splits | Yes | COCO (Common Objects in Context) [36] is a widely used benchmark for object detection, segmentation, and image captioning, consisting of 80 object categories. We use the 2017 split, with 118,000 images for training and 5,000 for validation. PASCAL VOC 2012 [13] provides 11,530 images with segmentation masks for 20 object categories. Following standard protocol, we use 10,582 images for training and 1,449 for validation. YTVIS (You Tube-VIS) [66] is a benchmark for video instance segmentation, containing 8,858 videos spanning 40 object categories, with pixel-level masks across frames. We train on the 2,985-video training split from the 2021 version (78,810 frames) and evaluate on 4,210 validation frames. YTOBJ (You Tube-Objects) [47] consists of 126 You Tube videos across 10 object categories, annotated with sparse bounding boxes for object tracking. We extract 388,050 frames from 100 videos for training and evaluate on 9,000 frames from 26 held-out videos. |
| Hardware Specification | Yes | Our experiments were conducted using four NVIDIA RTX 6000 GPUs, with some GPUs assigned multiple clients. |
| Software Dependencies | No | We use Adam optimizer with learning rate of 4 10 4, and weight decay of 4 10 4. |
| Experiment Setup | Yes | Each client trains for 100+ epochs with early stopping after 30 stagnant epochs. Fed Avg is performed globally every 100 iterations and locally (student-teacher) every 1000. We use Adam (lr = 4 10 4, batch size = 16). Please see more implementation details in the Supplementary Material F. Unless noted, evaluation uses the student decoder, while teacher results appear in Supplementary G. |