Geodesic Multi-Modal Mixup for Robust Fine-Tuning
Authors: Changdae Oh, Junhyuk So, Hoyoon Byun, YongTaek Lim, Minchul Shin, Jong-June Jeon, Kyungwoo Song
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments on retrieval, calibration, fewor zero-shot classification (under distribution shift), embedding arithmetic, and image captioning further show that our method provides transferable representations, enabling robust model adaptation on diverse tasks. |
| Researcher Affiliation | Academia | Changdae Oh University of Seoul Junhyuk So POSTECH Hoyoon Byun University of Seoul Yong Taek Lim University of Seoul Minchul Shin KAIST Jong-June Jeon University of Seoul Kyungwoo Song+ Yonsei University |
| Pseudocode | No | The paper states "pseudo code" is in Supplementary Material, but the provided text does not include the supplementary material, so it is not present in the scope of this analysis. Text: "Further details, hyperparameters selection, pseudo code, and additional results are put in Sec. A, B, and C of SM, respectively." |
| Open Source Code | Yes | Code: https://github.com/changdaeoh/multimodal-mixup |
| Open Datasets | Yes | First, we validate our method on image-text retrieval, a representative vision-language task, on Flickr30k [67] and MS COCO [70]. We consider Oxford Pets [75], SVHN [76], and CLEVR [77] for the general setting3 and Image Net-1k, Image Net V2 [78], Image Net-A [79], Image Net-R [80], and Image Net-Sketch [81] for distribution shift setting. In this section, we study whether m2-Mix can help the multi-modal representation learning for video recognition (CMU-MOSEI [83]) under modality missing. |
| Dataset Splits | No | The paper describes a "few-shot evaluation protocol: 16-shot training samples per class and inference on the entire test set" but does not specify a separate validation split or explicit proportions for validation data within the main text. Text: "Following [82, 64], we perform the tasks under a few-shot evaluation protocol: 16-shot training samples per class and inference on the entire test set." |
| Hardware Specification | No | The paper does not provide specific details about the hardware used for running the experiments (e.g., GPU models, CPU types, memory). It only discusses software components and training settings. |
| Software Dependencies | No | The paper mentions software like "Adam optimizer", "CLIP Vi T-B/32", "Open CLIP library", "BERT [73]", and "Res Net-50 [74]", but it does not specify version numbers for these components. Text: "All methods are trained over 9 epochs with Adam optimizer (details in SM). Unless otherwise stated, we adopt CLIP Vi T-B/32 as our backbone model." |
| Experiment Setup | Yes | All methods are trained over 9 epochs with Adam optimizer (details in SM). Following [82, 64], we perform the tasks under a few-shot evaluation protocol: 16-shot training samples per class and inference on the entire test set. FT (τ = 0.05). For all three methods, we train the model on MS COCO over one epoch with Open CLIP-provided hyperparameter configuration. |