reproducibilityindex.ai

Yo'LLaVA: Your Personalized Language and Vision Assistant

Authors: Thao Nguyen, Haotian Liu, Yuheng Li, Mu Cai, Utkarsh Ojha, Yong Jae Lee

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our qualitative and quantitative analyses reveal that Yo LLa VA can learn the concept more efficiently using fewer tokens and more effectively encode the visual attributes compared to strong prompting baselines (e.g., LLa VA).
Researcher Affiliation	Academia	Thao Nguyen Haotian Liu Yuheng Li Mu Cai Utkarsh Ojha Yong Jae Lee University of Wisconsin-Madison
Pseudocode	No	The paper describes the training pipeline and model architecture but does not include a clearly labeled 'Pseudocode' or 'Algorithm' block.
Open Source Code	Yes	Open source: We will publicly release the training and evaluation data for the personalized concept modeling task, as well as our code and models.
Open Datasets	Yes	Open source: We will publicly release the training and evaluation data for the personalized concept modeling task, as well as our code and models. Dataset. We collect a new dataset of 40 subjects: Person (10), Pets (5), Landmarks (5), Objects (15), and Fiction Characters (5).
Dataset Splits	No	The paper states, 'The dataset is divided into train and test splits.' It does not explicitly mention a separate validation split for the dataset.
Hardware Specification	Yes	All experiments are conducted on a single A6000 GPU.
Software Dependencies	No	The paper mentions 'Adam W [38]' as an optimizer and 'LLa VA-1.5-13B [10]' as the base model, but does not list specific version numbers for general software libraries like Python, PyTorch, or CUDA.
Experiment Setup	Yes	Training. Unless stated otherwise, we use 5 images and k = 16 tokens to learn the subject. Each conversation is single-turn (one question and one answer). We use Adam W [38] with a 0.001 learning rate and LLa VA-1.5-13B [10] as the base model. Training images include 200 negative images per subject ( 100 hard negatives from retrieval and 100 easy negatives randomly sampled). We train each subject for up to 15 epochs, saving the best checkpoint based on recognition accuracy on the train set.