Yo'LLaVA: Your Personalized Language and Vision Assistant
Authors: Thao Nguyen, Haotian Liu, Yuheng Li, Mu Cai, Utkarsh Ojha, Yong Jae Lee
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our qualitative and quantitative analyses reveal that Yo LLa VA can learn the concept more efficiently using fewer tokens and more effectively encode the visual attributes compared to strong prompting baselines (e.g., LLa VA). |
| Researcher Affiliation | Academia | Thao Nguyen Haotian Liu Yuheng Li Mu Cai Utkarsh Ojha Yong Jae Lee University of Wisconsin-Madison |
| Pseudocode | No | The paper describes the training pipeline and model architecture but does not include a clearly labeled 'Pseudocode' or 'Algorithm' block. |
| Open Source Code | Yes | Open source: We will publicly release the training and evaluation data for the personalized concept modeling task, as well as our code and models. |
| Open Datasets | Yes | Open source: We will publicly release the training and evaluation data for the personalized concept modeling task, as well as our code and models. Dataset. We collect a new dataset of 40 subjects: Person (10), Pets (5), Landmarks (5), Objects (15), and Fiction Characters (5). |
| Dataset Splits | No | The paper states, 'The dataset is divided into train and test splits.' It does not explicitly mention a separate validation split for the dataset. |
| Hardware Specification | Yes | All experiments are conducted on a single A6000 GPU. |
| Software Dependencies | No | The paper mentions 'Adam W [38]' as an optimizer and 'LLa VA-1.5-13B [10]' as the base model, but does not list specific version numbers for general software libraries like Python, PyTorch, or CUDA. |
| Experiment Setup | Yes | Training. Unless stated otherwise, we use 5 images and k = 16 tokens to learn the subject. Each conversation is single-turn (one question and one answer). We use Adam W [38] with a 0.001 learning rate and LLa VA-1.5-13B [10] as the base model. Training images include 200 negative images per subject ( 100 hard negatives from retrieval and 100 easy negatives randomly sampled). We train each subject for up to 15 epochs, saving the best checkpoint based on recognition accuracy on the train set. |