SocialGPT: Prompting LLMs for Social Relation Reasoning via Greedy Segment Optimization
Authors: Wanhua Li, Zibin Meng, Jiawei Zhou, Donglai Wei, Chuang Gan, Hanspeter Pfister
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | To summarize, we make the following contributions: 1). We present a simple modular framework with foundation models for social relation reasoning, which provides a strong baseline as the first zero-shot social relation recognition method. 2). To address the long prompt optimization issue associated with visual reasoning tasks, we further propose the Greedy Segment Prompt Optimization, which performs a greedy search on the segment level with gradient guidance. 3). Experiments demonstrate that our method attains very competitive and explainable zero-shot results without additional model training. With GSPO, our method significantly outperforms the state-of-the-art methods. |
| Researcher Affiliation | Collaboration | Wanhua Li ,1 Zibin Meng ,1,2 Jiawei Zhou3 Donglai Wei4 Chuang Gan5,6 Hanspeter Pfister1 1Harvard University 2Tsinghua University 3Stony Brook University 4Boston College 5MIT-IBM Watson AI Lab 6UMass Amherst |
| Pseudocode | Yes | Algorithm 1 Greedy Segment Prompt Optimization |
| Open Source Code | Yes | The code is available at https://github.com/Mengzibin/Social GPT. |
| Open Datasets | Yes | Data and Evaluation. We adopt two widely-used benchmarks for social relation reasoning: PIPA [1] and PISC [13]. The PIPA dataset categorizes 16 types of social relationships, including family bonds (like parent-child, grandparent-grandchild), personal connections (friends, loves/spouses), educational and professional interactions (teacher-student, leader-subordinate), and group associations (band, sports team, colleagues). The PISC dataset categorizes social relationships into six types: commercial, couple, family, friends, professional, and no-relation. |
| Dataset Splits | Yes | We follow the standard train/val/test split for both datasets and report the classification accuracy on the test set. |
| Hardware Specification | Yes | One A100 GPU is used for all experiments. |
| Software Dependencies | No | The paper mentions BLIP-2, SAM, GPT-3.5 Turbo, Vicuna7B/13B, and Llama2-7B/13B, but does not provide specific version numbers for all of them. For instance, while GPT-3.5 Turbo is mentioned, the exact version of BLIP-2 or SAM used is not specified with a number. |
| Experiment Setup | Yes | Implementation Details. We use two VFM models for visual information extraction the SAM [17] model for object segmentation, followed by BLIP-2 [41] for dense caption generation. For the social story generation, we employ the GPT-3.5 [55] Turbo model that has empowered Chat GPT. We set the temperature to 0 for greedy decoding to bolster the result s reproducibility. Other generation parameters are otherwise set as default. For subsequent reasoning of social relations based on generated stories, we experiment with both GPT-3.5 and open-source LLMs, including Vicuna7B/13B [29] and Llama2-7B/13B [34]. All the decoding temperature is set as 0, and we set the maximum context length to 4096 for Vicuna and Llama2 to accommodate our long prompt. For GSPO, we curate M = 15 candidates for each of the four segments within the complete prompt and set K = 3 for candidate selection for N = 500 iterations. |