Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Accountable Textual-Visual Chat Learns to Reject Human Instructions in Image Re-creation

Authors: Zhiwei Zhang, Yuliang Liu

TMLR 2024 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We conduct comprehensive analyses of experimental results, focusing on re-created image quality, answer accuracy, and the model s behavior when faced with uncertainty and imperfect user queries.
Researcher Affiliation Academia Zhiwei Zhang EMAIL The Chinese University of Hong Kong Centre for Perceptual and Interactive Intelligence Yuliang Liu EMAIL The Chinese University of Hong Kong Huazhong University of Science and Technology Centre for Perceptual and Interactive Intelligence
Pseudocode No The paper describes the model architecture and training procedure in descriptive text, but it does not include any explicitly labeled pseudocode blocks or algorithms.
Open Source Code Yes Dataset and code are available at https://matrix-alpha.github.io. All the datasets used in this study are publicly available, and we will release the source code for our annotation tool, evaluation tool, implementation of baseline models, metric calculations, and detailed instructions.
Open Datasets Yes In this paper, we address this gap by introducing two novel multimodal datasets: the synthetic CLEVR-ATVC dataset (620K) and the manually pictured Fruit-ATVC dataset (50K). ... Dataset and code are available at https://matrix-alpha.github.io.
Dataset Splits Yes CLEVR-ATVC consists of a training set with a total of 620,000 pairs and a testing set with 500 visual inputs, each accompanied by 10 queries, resulting in a total of 5,000 pairs. ... Fruit-ATVC comprises 27,503 training pairs and 1,872 pairs in the testing set.
Hardware Specification No The model is trained for approximate 900 GPU days on CLEVR-ATVC dataset and 350 GPU days on Fruit-ATVC dataset. No specific GPU models or other hardware details are provided.
Software Dependencies Yes All experiments are all conducted with Pytorch 1.8.1.
Experiment Setup Yes The model parameters are updated through Adam (Diederik P. Kingma, 2014) with β1 = 0.9, β2 = 0.999. We first trains the image reconstruction module for 200 epochs with an initial learning rate of 0.001 and a decay rate of 0.99 per epoch. ... The number of attention heads, the attention key and value dimensions, the number of layers, and the dimensions of the models are set to 8, 64, 4, and 512, respectively. ... The maximum length of the text sequence is set to 64, and the output length of the image codebook is 1024, which results in a total 2112 sequence lengths for the transformer model. The second stage is distributively trained over 200 epochs with a fixed learning rate of 0.0003.