Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Vec2Face: Scaling Face Dataset Generation with Loosely Constrained Vectors

Authors: Haiyu Wu, Jaskirat Singh, Sicong Tian, Liang Zheng, Kevin Bowyer

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	As for performance, FR models trained with the generated HSFace datasets, from 10k to 300k identities, achieve state-of-the-art accuracy, from 92% to 93.52%, on ﬁve real-world test sets (i.e., LFW, CFP-FP, Age DB-30, CALFW, and CPLFW).
Researcher Affiliation	Academia	1University of Notre Dame, 2The Australian National University,3Indiana University South Bend {EMAIL, EMAIL, EMAIL} {EMAIL, EMAIL}
Pseudocode	Yes	Algorithm 1: Attr OP
Open Source Code	No	The paper does not contain any explicit statements about releasing source code for the methodology described, nor does it provide a link to a code repository.
Open Datasets	Yes	The training data consists of 1M images and their features from 50K randomly sampled identities in Web Face4M (Zhu et al., 2023), where the images features are extracted by an Arc Face-R100 model pretrained on Glint360K (An et al., 2021). Unless otherwise speciﬁed, this model is used for feature extraction throughout this paper. There are eleven test sets used to compare the synthetic and real datasets on FR model training. LFW (Huang et al., 2008) tests the FR model in a general case. CFPFP (Sengupta et al., 2016) and CPLFW (Zheng & Deng, 2018) test the FR model on pose variation. Age DB (Moschoglou et al., 2017) and CALFW (Zheng et al., 2017) challenge the FR model with large age gap. ... In addition, SLLFW (Deng et al., 2017) and Doppel Ver (Thom et al., 2023), are used to evaluate the identity deﬁnition used in the existing works (including ours). Lastly, IJBB (Whitelam et al., 2017) and IJBC (Maze et al., 2018) are more challenging test sets that are closer to the real-world scenario.
Dataset Splits	No	The paper uses various publicly available test sets like LFW, CFP-FP, Age DB, CALFW, CPLFW, IJBB, and IJBC, which typically have predefined protocols or splits. For its own generated datasets (e.g., HSFace10K), it specifies total images and identities (e.g., "0.5M images from 10K identities"), but it does not detail how these generated datasets are further split into training, validation, or testing sets for the downstream FR model training. It mentions "50 perturbed vectors are sampled for each identity, where 40% from N(0, 0.3), 40% from N(0, 0.5), and 20% from N(0, 0.7)" for image creation, but this is a generation strategy, not a dataset split for evaluation.
Hardware Specification	Yes	We use approximately 10 RTX6000 GPUs for each training run. ... both models are tested on a single Titan-Xp
Software Dependencies	No	The paper mentions several models, frameworks, and optimizers (e.g., "Vi T-Base", "Adam W", "Arc Face"), but it does not provide specific version numbers for any software libraries, programming languages, or solvers used in the experiments.
Experiment Setup	Yes	We used the default Vi T-Base as the backbone to form the f MAE. ... The optimizer is Adam W (Loshchilov & Hutter, 2017), with a learning rate of 4e-5 and a batch size of 32 per GPU. ... Hyperparameter τ is selected such that identities are well-separated, i.e., high inter-class variation, although it does not act as a strong ﬁlter in practice. The selection of hyperparameter σ should enable the generated images for an identity to have large enough variance while staying on the same identity, i.e., high intra-class variation. ... 50 perturbed vectors are sampled for each identity, where 40% from N(0, 0.3), 40% from N(0, 0.5), and 20% from N(0, 0.7). ... the target pose P of 20 images is 60 and 10 images is 85. Meanwhile, we add the image quality control, Q = 27, to mitigate the quality degradation during pose adjustment. ... T is set to 5. ... The standard training dataset size is 0.5M images from 10K identities. Unless otherwise speciﬁed, the backbone is SE-IR50 (He et al., 2016), the recognition loss is Arc Face (Deng et al., 2019)