Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

PoseLLaVA: Pose Centric Multimodal LLM for Fine-Grained 3D Pose Manipulation

Authors: Dong Feng, Ping Guo, Encheng Peng, Mingmin Zhu, Wenhao Yu, Peng Wang

AAAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive evaluations across these tasks demonstrate significant improvements over existing methods, including metrics such as MPJPE and PA-MPJPE, which measure SMPL reconstruction errors, and Recall rates, which assess feature alignment across modalities. Specifically, Pose LLa VA reduces MPJPE errors by more than 20% compared to state-of-the-art methods in pose adjustment and generation tasks. Additionally, we demonstrate the feasibility of combining Pose LLa VA with generative models, such as diffusion, for pose image editing, highlighting its potential applications in language-controlled pose manipulation.
Researcher Affiliation Collaboration Dong Feng1*, Ping Guo2* , Encheng Peng3, Mingmin Zhu1, Wenhao Yu4, Peng Wang2 1 inchitech 2 Intel Labs China 3 Nanjing University of Posts and Telecommunications 4 Beijing Jiaotong University
Pseudocode No The paper describes the architecture and training pipeline but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code Yes Code https://github.com/ustcfd/Pose LLa VA
Open Datasets Yes The training dataset is constructed by converting each task-specific datasets into instruction-following datasets, including Human3.6M (Ionescu et al. 2013), Pose Script (Delmas et al. 2022), Pose Fix (Delmas et al. 2023) and the new introduced Pose Part. ... To accommodate both pose and image modalities, we used mmhuman3d (Contributors 2021) to render SMPL pose images for both the Pose Fix and Pose Part datasets.
Dataset Splits Yes Unlike traditional methods such as (Lin et al. 2023), which often rely on extensive data augmentation, our approach does not employ any data augmentation. Instead, we sampled only 300,000 instances from the original data to demonstrate the effectiveness of Pose LLa VA. Following the methodology used in Chat Pose (Feng et al. 2024), we randomly selected 200 samples from the Human3.6M (Ionescu et al. 2013) and 3DPW (Von Marcard et al. 2018) test set for evaluation. ... For the pose generation task, we used the Pose Script (Delmas et al. 2024) dataset, which includes textual descriptions for 100,000 diverse human poses sourced from the AMASS dataset. We followed the dataset splits used in Pose Script and Pose Chat to create training and evaluation sets.
Hardware Specification Yes We employ 8 NVIDIA 40G Tesla A100 GPUs for training.
Software Dependencies No The paper mentions specific pre-trained weights and models like "Llava-v1.6-mistral-7b", "CLIP-Vi T", "mistral", and "Lo RA" but does not provide version numbers for general software components or programming languages like Python, PyTorch, or CUDA.
Experiment Setup Yes The LLM is tuned by Lo RA(Hu et al. 2022) with a rank of 128 and an alpha of 256. The batch size per device is set to 16 with gradient accumulation step of 4 and the training process include 2 epochs in total.