reproducibilityindex.ai

ILLUME: Rationalizing Vision-Language Models through Human Interactions

Authors: Manuel Brack, Patrick Schramowski, Björn Deiseroth, Kristian Kersting

ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our exhaustive experiments demonstrate that ILLUME is competitive with standard supervised fine-tuning while using significantly fewer training data and only requiring minimal feedback.1
Researcher Affiliation	Collaboration	1German Center for Artificial Intelligence (DFKI) 2Computer Science Department, TU Darmstadt 3Hessian Center for AI (hessian.AI) 4LAION 5Aleph Alpha 6Centre for Cognitive Science, TU Darmstadt.
Pseudocode	No	The paper describes its methods in text and diagrams but does not contain any structured pseudocode or algorithm blocks.
Open Source Code	Yes	Code available at: https://github.com/ml-research/ILLUME
Open Datasets	Yes	We use six diverse commonsense reasoning benchmarks to evaluate self-talk in natural language. These datasets are CSQA (Talmor et al., 2019), COPA (Gordon et al., 2012; Roemmele et al., 2011), Mc Taco (Zhou et al., 2019), PIQA (Bisk et al., 2020), Social IQA (Sap et al., 2019) and Wino Grande (Sakaguchi et al., 2020). For the visual reasoning task we consider three datasets, namely VQA-X, ACT-X (Park et al., 2018), and CLEVR-X (Salewski et al., 2020). For this task we used the Socio-Moral-Image-Database (SMID) (Crone et al., 2018).
Dataset Splits	Yes	Any additional hyper-parameter optimization was performed on the dedicated validation splits, with the test splits being evaluated only for reporting final scores.
Hardware Specification	Yes	The training was distributed over 8 A100 GPUs resulting in a per GPU batch size of 32.
Software Dependencies	No	The paper does not provide specific version numbers for software dependencies such as Python, PyTorch, or CUDA.
Experiment Setup	Yes	For all experiments, we used the Adam W optimizer and a batch size of 256. The training was distributed over 8 A100 GPUs resulting in a per GPU batch size of 32. Regarding Eq. 1, we added roughly ten times more samples without explanation XA than XE to regularize optimization.