Aligning Vision Models with Human Aesthetics in Retrieval: Benchmarks and Algorithms

Authors: Miaosen Zhang, Yixuan Wei, Zhen Xing, Yifei Ma, Zuxuan Wu, Ji Li, Zheng Zhang, Qi Dai, Chong Luo, Xin Geng, Baining Guo

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments demonstrate that our method significantly enhances the aesthetic behaviors of the vision models, under several metrics.
Researcher Affiliation Collaboration 1Southeast University 2Tsinghua University 3Fudan University 4Microsoft
Pseudocode No The paper does not contain structured pseudocode or algorithm blocks.
Open Source Code No It is particularly costly to conduct a comprehensive code review. We plan to release the code in future. The training data won t be released due to privacy reasons.
Open Datasets Yes They are trained on very large image-text pair datasets, e.g. LAION [43] and Data Comp [8], rather than the traditional Image Net [6].
Dataset Splits No The paper describes the construction of a training dataset (Dpo) and introduces a test set (HPIR) but does not explicitly provide standard training, validation, and test splits for a single dataset.
Hardware Specification Yes The computational resources include 256 NVIDIA V100 GPUs.
Software Dependencies No The paper mentions optimizers and model architectures but does not specify software dependencies (e.g., libraries or frameworks) with version numbers.
Experiment Setup Yes In the alignment fine-tuning loss, the Lpt component is configured identically to the pretraining phase described in Sec. 2.1, encompassing batch size, temperature, and data, with a weight of wpt = 1.0. For the remaining components, each batch comprises 128 queries. The overall learning rate is fixed to lr = 5 10 5. The partially ordered set Dpo, as discussed in Sec. 2.3, is derived using u = v = 5, and a stride of 10.