Action Imitation in Common Action Space for Customized Action Image Synthesis

Authors: wang lin, Jingyuan Chen, Jiaxin Shi, Zirun Guo, Yichen Zhu, Zehan Wang, Tao Jin, Zhou Zhao, Fei Wu, Shuicheng Yan, Hanwang Zhang

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments demonstrate Twin Act s superiority in generating accurate, context-independent customized actions while maintaining the identity consistency of different subjects, including animals, humans, and even customized actors.
Researcher Affiliation Collaboration Wang Lin1, , Jingyuan Chen1, , Jiaxin Shi2, , Zirun Guo1, Yichen Zhu1, Zehan Wang1, Tao Jin1, Zhou Zhao1, Fei Wu1, Shuicheng YAN3, Hanwang Zhang4 Zhejiang University1, Xmax.AI2, Skywork AI Singapore3, Nanyang Technological University4
Pseudocode No The paper describes the steps of the proposed method in prose within Section 3 but does not include any clearly labeled pseudocode or algorithm blocks.
Open Source Code Yes Project page: https://twinact-official.github.io/Twin Act/.
Open Datasets No Since there are no publicly available customized action datasets, we introduce a novel benchmark, consisting of 12 actions involving multiple body parts, such as fingers, arms, legs, and full-body motions.
Dataset Splits No The paper introduces a novel benchmark dataset but does not specify the train/validation/test dataset splits, percentages, or absolute sample counts for each split in the main text.
Hardware Specification Yes All experiments are conducted on A-100 GPUs.
Software Dependencies No The paper mentions various software components and models (e.g., Adam W, Stable Diffusion XL, Lo RA, CLIP, DDPM, GPT-4) but does not specify their exact version numbers, which is required for reproducibility.
Experiment Setup Yes For our method, we use the Adam W[12] optimizer with a learning rate of 2e-4. We use CLIP as a preprocessor to estimate the action of the given reference image. Unless otherwise specified, Stable Diffusion XL is selected as the default pre-trained model, and images are generated at a resolution of 1024 × 1024. All experiments are conducted on A-100 GPUs. We fine-tuning text embedding along with the Lo RA layer. We integrate the Lo RA layer into the linear layer within all attention modules of the U-net, utilizing a rank of r = 8. All experiments and evaluations make use of the DDPM [25] with 50 sampling steps with a scale of 7.5 for all methods. The negative prompt used is long body, low-res, bad anatomy, bad hands, missing fingers, extra digit, fewer digits, cropped, worst quality, low quality.