Action Imitation in Common Action Space for Customized Action Image Synthesis
Authors: wang lin, Jingyuan Chen, Jiaxin Shi, Zirun Guo, Yichen Zhu, Zehan Wang, Tao Jin, Zhou Zhao, Fei Wu, Shuicheng Yan, Hanwang Zhang
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments demonstrate Twin Act s superiority in generating accurate, context-independent customized actions while maintaining the identity consistency of different subjects, including animals, humans, and even customized actors. |
| Researcher Affiliation | Collaboration | Wang Lin1, , Jingyuan Chen1, , Jiaxin Shi2, , Zirun Guo1, Yichen Zhu1, Zehan Wang1, Tao Jin1, Zhou Zhao1, Fei Wu1, Shuicheng YAN3, Hanwang Zhang4 Zhejiang University1, Xmax.AI2, Skywork AI Singapore3, Nanyang Technological University4 |
| Pseudocode | No | The paper describes the steps of the proposed method in prose within Section 3 but does not include any clearly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | Project page: https://twinact-official.github.io/Twin Act/. |
| Open Datasets | No | Since there are no publicly available customized action datasets, we introduce a novel benchmark, consisting of 12 actions involving multiple body parts, such as fingers, arms, legs, and full-body motions. |
| Dataset Splits | No | The paper introduces a novel benchmark dataset but does not specify the train/validation/test dataset splits, percentages, or absolute sample counts for each split in the main text. |
| Hardware Specification | Yes | All experiments are conducted on A-100 GPUs. |
| Software Dependencies | No | The paper mentions various software components and models (e.g., Adam W, Stable Diffusion XL, Lo RA, CLIP, DDPM, GPT-4) but does not specify their exact version numbers, which is required for reproducibility. |
| Experiment Setup | Yes | For our method, we use the Adam W[12] optimizer with a learning rate of 2e-4. We use CLIP as a preprocessor to estimate the action of the given reference image. Unless otherwise specified, Stable Diffusion XL is selected as the default pre-trained model, and images are generated at a resolution of 1024 × 1024. All experiments are conducted on A-100 GPUs. We fine-tuning text embedding along with the Lo RA layer. We integrate the Lo RA layer into the linear layer within all attention modules of the U-net, utilizing a rank of r = 8. All experiments and evaluations make use of the DDPM [25] with 50 sampling steps with a scale of 7.5 for all methods. The negative prompt used is long body, low-res, bad anatomy, bad hands, missing fingers, extra digit, fewer digits, cropped, worst quality, low quality. |