reproducibilityindex.ai

CLIP the Bias: How Useful is Balancing Data in Multimodal Learning?

Authors: Ibrahim Alabdulmohsin, Xiao Wang, Andreas Peter Steiner, Priya Goyal, Alexander D'Amour, Xiaohua Zhai

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our study also explores the dynamic nature of how CLIP learns/unlearns biases. In particular, we ﬁnd that ﬁne-tuning is effective in countering representation biases, though its impact diminishes for association biases. Also, data balancing has a mixed impact on quality: it tends to improve classiﬁcation but can hurt retrieval. Interestingly, data and architectural improvements seem to mitigate the negative impact of data balancing on performance; e.g. applying M4 to Sig LIP-B/16 with data quality ﬁlters improves COCO image-to-text retrieval @5 from 86% (without data balancing) to 87% and Image Net 0-shot classiﬁcation from 77% to 77.5%! Finally, we conclude with recommendations for improving the efﬁcacy of data balancing in multimodal systems.
Researcher Affiliation	Industry	Google Deepmind: Zürich, Switzerland. New York, USA. Boston, USA. {ibomohsin,xzhai}@google.com
Pseudocode	Yes	An overview of the data balancing algorithm is shown in Figure 7. It maintains two optimization variables v R2m(c+1) and µ R, which are used to calculate the sample weight q by solving: ... Figure 7: Pseudo-code of the data balancing algorithm in Section 5. LEFT: Single update per example (s, y, u), where u is the example s utility. RIGHT: Numpy-like implementation of the bias vector a.
Open Source Code	No	The paper does not provide explicit statements about the release of its own source code (e.g., 'We release our code...') nor does it include a direct link to a code repository for the methodology described.
Open Datasets	Yes	We evaluate the models on Image Net-ILSRCV2012 (Deng et al., 2009), Fair Face (Karkkainen & Joo, 2021), UTKFace (Zhang et al., 2017), and MIAP (Schumann et al., 2021).
Dataset Splits	No	The paper describes a two-stage training process and data sizes ('1B image-text pairs... We vary the length of Stage 1 in {0%, 10%, 90%}'). However, it does not specify explicit train/validation/test dataset splits with percentages, absolute counts, or references to predefined splits for the main training dataset.
Hardware Specification	Yes	Hardware & Software: The model is implemented using JAX (Bradbury et al., 2018), Flax (Heek et al., 2020), and Big Vision (Beyer et al., 2022). It is trained on TPU v2. Compute Requirements: Each model is trained on 8 8 TPU v2 chips on 1B seen imagetext pairs.
Software Dependencies	No	The paper mentions software used ('JAX', 'Flax', 'Big Vision', 'SentencePiece') and cites their origin, but it does not provide specific version numbers for these software dependencies (e.g., 'JAX 0.3.17').
Experiment Setup	Yes	A.4 TRAINING CONFIGURATION: batch_size = 16_384, shuffle_buffer_size = 250_000, pp = 'decode\|resize(224)\|value_range(-1,1)', model.temperature_init = 10.0, optax_name = 'scale_by_adafactor', grad_clip_norm = 1.0, lr = 0.001, wd = 0.0001, schedule.decay_type = 'rsqrt', schedule.timescale = 5_000, schedule.warmup_steps = 5_000.