Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
FaceA-Net: Facial Attribute-Driven ID Preserving Image Generation Network
Authors: Jiayu Wang, Yue Yu, Jingjing Chen, Qi Dai, Yu-Gang Jiang
AAAI 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Through extensive experiments, we demonstrate the effectiveness of the proposed method and showcase its capabilities across various applications. We have conducted experiments to demonstrate that our method can effectively improve the ID fidelity and the context quality of the generated results. Experiment Experiment Settings Implementation Details. We employ a test-time fine-tuning paradigm for ID preserving image generation... Evaluation metrics. To ensure a fair experiment, we constructed an evaluation dataset... Quantitative Evaluation Objective Metrics Table 1 presents the results of our experiment, comparing our method with existing popular approaches... Ablation Studies Ablation on Learnable Embedding We conducted comparisons with two ablated models on the evaluation dataset, and the results are presented in Table 2. |
| Researcher Affiliation | Collaboration | Jiayu Wang*1,2, Yue Yu*1,2, Jingjing Chen1,2 , Qi Dai3, Yu-Gang Jiang1,2 1Shanghai Key Lab of Intell. Info. Processing, School of Computer Science, Fudan University 2Shanghai Collaborative Innovation Center of Intelligent Visual Computing 3Microsoft Research Asia EMAIL, EMAIL, EMAIL |
| Pseudocode | No | The paper describes its methodology using textual descriptions and mathematical equations (e.g., Equation 1 and 2), but it does not include any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not provide any explicit statement about releasing source code, nor does it include a link to a code repository. Phrases like 'We release our code' or 'Code is available at' are not present. |
| Open Datasets | No | To ensure a fair experiment, we constructed an evaluation dataset, consisting of 28 IDs, each with 3 images, and used 40 prompts to generate images. When calculating CLIP-I, DINO, and Face Sim., we used non-reference images to compute these scores with the generated images for fairness. These nonreference images were not used for fine-tuning or for extracting facial features. The paper mentions constructing its own evaluation dataset but does not provide access information (link, DOI, or citation to a public repository) for this dataset or any other dataset used in their experiments. |
| Dataset Splits | No | The fine-tuning process is conducted on a small set consisting of 3 images of the same human ID. To ensure a fair experiment, we constructed an evaluation dataset, consisting of 28 IDs, each with 3 images, and used 40 prompts to generate images. The paper describes the number of images used for fine-tuning and evaluation but does not provide specific training/test/validation splits, percentages, or absolute sample counts for data partitioning to reproduce experiments. |
| Hardware Specification | Yes | We conduct the fine-tuning on an RTX-4090 GPU with a batch size of 4 for 1000 steps. |
| Software Dependencies | No | Our work is based on Stable Diffusion (Rombach et al. 2022) for both the generation of human IDs and contextual content... SDXL (Podell et al. 2023) is used as the context generation model... We utilize Grounding-DINO (Liu et al. 2023) to detect the face and its attribute... The paper mentions several software components and models (Stable Diffusion, SDXL, Grounding-DINO, U-Net LoRA, CLIP text encoder, Face Net) but does not provide specific version numbers for these, which is necessary for reproducible software dependencies. |
| Experiment Setup | Yes | The fine-tuning process is conducted on a small set consisting of 3 images of the same human ID. Before fine-tuning, the images are cropped and resized to 512 × 512 pixels. An U-Net Lo RA with rank of 48 is employed for tuning the inpainting model. Moreover, we apply 3 attribute-driven feature embeddings for males and 4 for females to achieve optimal performance. Different learning rates are used to tune different parts. Specifically, we set 2e-3 for the attribute-driven feature embeddings and 1.2e-4 for Lo RA. We conduct the fine-tuning on an RTX-4090 GPU with a batch size of 4 for 1000 steps. Generally, training for 600 steps during test-time fine-tuning is an appropriate choice, as it achieves both good controllability and satisfactory ID fidelity. Accordingly, we employ 600 fine-tuning steps for stylized generation. |