ILLUME: Rationalizing Vision-Language Models through Human Interactions
Authors: Manuel Brack, Patrick Schramowski, Björn Deiseroth, Kristian Kersting
ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our exhaustive experiments demonstrate that ILLUME is competitive with standard supervised fine-tuning while using significantly fewer training data and only requiring minimal feedback.1 |
| Researcher Affiliation | Collaboration | 1German Center for Artificial Intelligence (DFKI) 2Computer Science Department, TU Darmstadt 3Hessian Center for AI (hessian.AI) 4LAION 5Aleph Alpha 6Centre for Cognitive Science, TU Darmstadt. |
| Pseudocode | No | The paper describes its methods in text and diagrams but does not contain any structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | Code available at: https://github.com/ml-research/ILLUME |
| Open Datasets | Yes | We use six diverse commonsense reasoning benchmarks to evaluate self-talk in natural language. These datasets are CSQA (Talmor et al., 2019), COPA (Gordon et al., 2012; Roemmele et al., 2011), Mc Taco (Zhou et al., 2019), PIQA (Bisk et al., 2020), Social IQA (Sap et al., 2019) and Wino Grande (Sakaguchi et al., 2020). For the visual reasoning task we consider three datasets, namely VQA-X, ACT-X (Park et al., 2018), and CLEVR-X (Salewski et al., 2020). For this task we used the Socio-Moral-Image-Database (SMID) (Crone et al., 2018). |
| Dataset Splits | Yes | Any additional hyper-parameter optimization was performed on the dedicated validation splits, with the test splits being evaluated only for reporting final scores. |
| Hardware Specification | Yes | The training was distributed over 8 A100 GPUs resulting in a per GPU batch size of 32. |
| Software Dependencies | No | The paper does not provide specific version numbers for software dependencies such as Python, PyTorch, or CUDA. |
| Experiment Setup | Yes | For all experiments, we used the Adam W optimizer and a batch size of 256. The training was distributed over 8 A100 GPUs resulting in a per GPU batch size of 32. Regarding Eq. 1, we added roughly ten times more samples without explanation XA than XE to regularize optimization. |