Language-Informed Visual Concept Learning

Authors: Sharon Lee, Yunzhi Zhang, Shangzhe Wu, Jiajun Wu

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments show that this visual concept representation achieves better disentanglement and compositionality, compared to text-based prompting baselines, as shown in Figures 2 and 6. We conduct thorough evaluations both quantitatively and qualitatively, and demonstrate that our approach yields superior results in visual concept editing compared to prior work.
Researcher Affiliation Academia Sharon Lee Yunzhi Zhang Shangzhe Wu Jiajun Wu Stanford University
Pseudocode No The paper describes the training pipeline and methods in prose and figures, but does not include formal pseudocode or algorithm blocks.
Open Source Code No Project page at https://cs.stanford.edu/ yzzhang/projects/concept-axes. (This is a project page, not a direct code repository link or explicit statement of code release for the methodology.)
Open Datasets No We train the concept encoders only using synthetic images generated by Deep Floyd from 5 different domains, including fruits, figurines, furniture, art, and clothing. More details of our dataset can be found in A.2. (Appendix A.2 describes the generation process and the categories used to generate the data, but not how to access the generated dataset itself, e.g., a link or repository.)
Dataset Splits No The paper describes training data and test-time finetuning on test images, but does not explicitly specify a validation dataset split with percentages or counts, nor does it mention a dedicated validation set.
Hardware Specification Yes Training on one dataset takes approximately 12 hours on one NVIDIA Ge Force RTX 3090 GPU.
Software Dependencies No The paper mentions software and models like Deep Floyd, T5, BLIP-2, CLIP, and Adam W, but does not provide specific version numbers for these or other underlying software dependencies (e.g., Python, PyTorch versions).
Experiment Setup Yes For training, we use Adam W (Loshchilov & Hutter, 2017) optimizer with learning rate 0.02, and randomly flip the images horizontally. For test-time finetuning, we use the Adam W optimizer with learning rate 0.001. We set λk = 0.0001 (Equation (3)) for the category axis and λ = 0.001 for others. We use IF-I-XL from Deep Floyd as the backbone model, with training resolution 64x64. Training on one dataset takes approximately 12 hours on one NVIDIA Ge Force RTX 3090 GPU.