Aligning Vision Models with Human Aesthetics in Retrieval: Benchmarks and Algorithms
Authors: Miaosen Zhang, Yixuan Wei, Zhen Xing, Yifei Ma, Zuxuan Wu, Ji Li, Zheng Zhang, Qi Dai, Chong Luo, Xin Geng, Baining Guo
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments demonstrate that our method significantly enhances the aesthetic behaviors of the vision models, under several metrics. |
| Researcher Affiliation | Collaboration | 1Southeast University 2Tsinghua University 3Fudan University 4Microsoft |
| Pseudocode | No | The paper does not contain structured pseudocode or algorithm blocks. |
| Open Source Code | No | It is particularly costly to conduct a comprehensive code review. We plan to release the code in future. The training data won t be released due to privacy reasons. |
| Open Datasets | Yes | They are trained on very large image-text pair datasets, e.g. LAION [43] and Data Comp [8], rather than the traditional Image Net [6]. |
| Dataset Splits | No | The paper describes the construction of a training dataset (Dpo) and introduces a test set (HPIR) but does not explicitly provide standard training, validation, and test splits for a single dataset. |
| Hardware Specification | Yes | The computational resources include 256 NVIDIA V100 GPUs. |
| Software Dependencies | No | The paper mentions optimizers and model architectures but does not specify software dependencies (e.g., libraries or frameworks) with version numbers. |
| Experiment Setup | Yes | In the alignment fine-tuning loss, the Lpt component is configured identically to the pretraining phase described in Sec. 2.1, encompassing batch size, temperature, and data, with a weight of wpt = 1.0. For the remaining components, each batch comprises 128 queries. The overall learning rate is fixed to lr = 5 10 5. The partially ordered set Dpo, as discussed in Sec. 2.3, is derived using u = v = 5, and a stride of 10. |