SocialGPT: Prompting LLMs for Social Relation Reasoning via Greedy Segment Optimization

Authors: Wanhua Li, Zibin Meng, Jiawei Zhou, Donglai Wei, Chuang Gan, Hanspeter Pfister

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental To summarize, we make the following contributions: 1). We present a simple modular framework with foundation models for social relation reasoning, which provides a strong baseline as the first zero-shot social relation recognition method. 2). To address the long prompt optimization issue associated with visual reasoning tasks, we further propose the Greedy Segment Prompt Optimization, which performs a greedy search on the segment level with gradient guidance. 3). Experiments demonstrate that our method attains very competitive and explainable zero-shot results without additional model training. With GSPO, our method significantly outperforms the state-of-the-art methods.
Researcher Affiliation Collaboration Wanhua Li ,1 Zibin Meng ,1,2 Jiawei Zhou3 Donglai Wei4 Chuang Gan5,6 Hanspeter Pfister1 1Harvard University 2Tsinghua University 3Stony Brook University 4Boston College 5MIT-IBM Watson AI Lab 6UMass Amherst
Pseudocode Yes Algorithm 1 Greedy Segment Prompt Optimization
Open Source Code Yes The code is available at https://github.com/Mengzibin/Social GPT.
Open Datasets Yes Data and Evaluation. We adopt two widely-used benchmarks for social relation reasoning: PIPA [1] and PISC [13]. The PIPA dataset categorizes 16 types of social relationships, including family bonds (like parent-child, grandparent-grandchild), personal connections (friends, loves/spouses), educational and professional interactions (teacher-student, leader-subordinate), and group associations (band, sports team, colleagues). The PISC dataset categorizes social relationships into six types: commercial, couple, family, friends, professional, and no-relation.
Dataset Splits Yes We follow the standard train/val/test split for both datasets and report the classification accuracy on the test set.
Hardware Specification Yes One A100 GPU is used for all experiments.
Software Dependencies No The paper mentions BLIP-2, SAM, GPT-3.5 Turbo, Vicuna7B/13B, and Llama2-7B/13B, but does not provide specific version numbers for all of them. For instance, while GPT-3.5 Turbo is mentioned, the exact version of BLIP-2 or SAM used is not specified with a number.
Experiment Setup Yes Implementation Details. We use two VFM models for visual information extraction the SAM [17] model for object segmentation, followed by BLIP-2 [41] for dense caption generation. For the social story generation, we employ the GPT-3.5 [55] Turbo model that has empowered Chat GPT. We set the temperature to 0 for greedy decoding to bolster the result s reproducibility. Other generation parameters are otherwise set as default. For subsequent reasoning of social relations based on generated stories, we experiment with both GPT-3.5 and open-source LLMs, including Vicuna7B/13B [29] and Llama2-7B/13B [34]. All the decoding temperature is set as 0, and we set the maximum context length to 4096 for Vicuna and Llama2 to accommodate our long prompt. For GSPO, we curate M = 15 candidates for each of the four segments within the complete prompt and set K = 3 for candidate selection for N = 500 iterations.