Yo'LLaVA: Your Personalized Language and Vision Assistant

Authors: Thao Nguyen, Haotian Liu, Yuheng Li, Mu Cai, Utkarsh Ojha, Yong Jae Lee

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our qualitative and quantitative analyses reveal that Yo LLa VA can learn the concept more efficiently using fewer tokens and more effectively encode the visual attributes compared to strong prompting baselines (e.g., LLa VA).
Researcher Affiliation Academia Thao Nguyen Haotian Liu Yuheng Li Mu Cai Utkarsh Ojha Yong Jae Lee University of Wisconsin-Madison
Pseudocode No The paper describes the training pipeline and model architecture but does not include a clearly labeled 'Pseudocode' or 'Algorithm' block.
Open Source Code Yes Open source: We will publicly release the training and evaluation data for the personalized concept modeling task, as well as our code and models.
Open Datasets Yes Open source: We will publicly release the training and evaluation data for the personalized concept modeling task, as well as our code and models. Dataset. We collect a new dataset of 40 subjects: Person (10), Pets (5), Landmarks (5), Objects (15), and Fiction Characters (5).
Dataset Splits No The paper states, 'The dataset is divided into train and test splits.' It does not explicitly mention a separate validation split for the dataset.
Hardware Specification Yes All experiments are conducted on a single A6000 GPU.
Software Dependencies No The paper mentions 'Adam W [38]' as an optimizer and 'LLa VA-1.5-13B [10]' as the base model, but does not list specific version numbers for general software libraries like Python, PyTorch, or CUDA.
Experiment Setup Yes Training. Unless stated otherwise, we use 5 images and k = 16 tokens to learn the subject. Each conversation is single-turn (one question and one answer). We use Adam W [38] with a 0.001 learning rate and LLa VA-1.5-13B [10] as the base model. Training images include 200 negative images per subject ( 100 hard negatives from retrieval and 100 easy negatives randomly sampled). We train each subject for up to 15 epochs, saving the best checkpoint based on recognition accuracy on the train set.