ILLUME: Rationalizing Vision-Language Models through Human Interactions

Authors: Manuel Brack, Patrick Schramowski, Björn Deiseroth, Kristian Kersting

ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our exhaustive experiments demonstrate that ILLUME is competitive with standard supervised fine-tuning while using significantly fewer training data and only requiring minimal feedback.1
Researcher Affiliation Collaboration 1German Center for Artificial Intelligence (DFKI) 2Computer Science Department, TU Darmstadt 3Hessian Center for AI (hessian.AI) 4LAION 5Aleph Alpha 6Centre for Cognitive Science, TU Darmstadt.
Pseudocode No The paper describes its methods in text and diagrams but does not contain any structured pseudocode or algorithm blocks.
Open Source Code Yes Code available at: https://github.com/ml-research/ILLUME
Open Datasets Yes We use six diverse commonsense reasoning benchmarks to evaluate self-talk in natural language. These datasets are CSQA (Talmor et al., 2019), COPA (Gordon et al., 2012; Roemmele et al., 2011), Mc Taco (Zhou et al., 2019), PIQA (Bisk et al., 2020), Social IQA (Sap et al., 2019) and Wino Grande (Sakaguchi et al., 2020). For the visual reasoning task we consider three datasets, namely VQA-X, ACT-X (Park et al., 2018), and CLEVR-X (Salewski et al., 2020). For this task we used the Socio-Moral-Image-Database (SMID) (Crone et al., 2018).
Dataset Splits Yes Any additional hyper-parameter optimization was performed on the dedicated validation splits, with the test splits being evaluated only for reporting final scores.
Hardware Specification Yes The training was distributed over 8 A100 GPUs resulting in a per GPU batch size of 32.
Software Dependencies No The paper does not provide specific version numbers for software dependencies such as Python, PyTorch, or CUDA.
Experiment Setup Yes For all experiments, we used the Adam W optimizer and a batch size of 256. The training was distributed over 8 A100 GPUs resulting in a per GPU batch size of 32. Regarding Eq. 1, we added roughly ten times more samples without explanation XA than XE to regularize optimization.