Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

ESCA: Contextualizing Embodied Agents via Scene-Graph Generation

Authors: Jiani Huang, Amish Sethi, Matthew Kuo, Mayank Keoliya, Neelay Velingker, JungHo Jung, Ser Nam Lim, Ziyang Li, Mayur Naik

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Through experiments on four challenging embodied environments, we demonstrate that ESCA consistently improves the performance of all evaluated MLLMs, including both open-source and proprietary models. By providing structured and grounded scene graphs, ESCA significantly reduces perception errors, laying the foundation for more reliable reasoning and planning. Our experiments are designed to address two key research questions: (1) How effectively does ESCA, together with SGClip, improve embodied agent performance through structured scene graph generation? and (2) How generalizable and adaptable is SGClip when evaluated independently on open-domain, zero-shot, and downstream transfer tasks? We now detail our experimental setup and present empirical results addressing both questions.
Researcher Affiliation Academia Jiani Huang University of Pennsylvania EMAIL Amish Sethi University of Pennsylvania EMAIL Matthew Kuo University of Pennsylvania EMAIL Mayank Keoliya University of Pennsylvania EMAIL Neelay Velingker University of Pennsylvania EMAIL Jung Ho Jung University of Pennsylvania EMAIL Ser-Nam Lim University of Central Florida EMAIL Ziyang Li Johns Hopkins University EMAIL Mayur Naik University of Pennsylvania EMAIL
Pseudocode Yes Algorithm 1: Video Mask Propagation with New Object Discovery Algorithm 2: Bounding Box Prompt Buffer Class Algorithm 3: Mask Generation via Frame-wise Bounding Box Grounding
Open Source Code Yes We release the source code for SGCLIP model training at https://github.com/video-fm/LASER and for the embodied agent at https://github.com/video-fm/ESCA.
Open Datasets No The ESCA-Video-87K dataset is constructed from the publicly available LLaVA-Video-178K dataset [97]... We will release the dataset and open-source the code upon paper acceptance.
Dataset Splits Yes We use a combined version of Activity Net 1.2 and 1.3, which together span 200 unique action classes and approximately 20,000 untrimmed videos across training, validation, and test splits.
Hardware Specification Yes All our experiments are carried out on a device with (1) 128 32-core Intel(R) Xeon(R) Gold 6338 CPU @ 2.00GHz (2) 10 NVIDIA H100 PCIe GPUs.
Software Dependencies No Using word vectors from SpaCy [29]... We leverage the Scallop programming language [50].
Experiment Setup Yes We use a learning rate of 1e-6 and a batch size of 2. The video is sampled at a target frame rate of 1 FPS. For the semantic loss, we sample 5 negative keywords per instance and set the semantic loss weight to 0.1. In the provenance setting for Scallop, we use difftopkproofs with a top-k value of 3 for proof extraction. We fine tune from the CLIP model with a total of 3 epochs