Visual Emotion Representation Learning via Emotion-Aware Pre-training

Authors: Yue Zhang, Wanying Ding, Ran Xu, Xiaohua Hu

IJCAI 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We conduct our pre-training on a large web dataset with noisy tags and fine-tune on smaller visual emotion classification datasets with class label supervision. Our method achieves state-of-the-art performance for visual emotion classification. In this section, we first describe the datasets for pre-training and adaptation in Section 4.1, then introduce baseline methods and experimental results in Section 4.2, we finally conduct ablation studies and summarize in Section 4.3.
Researcher Affiliation Collaboration Yue Zhang1 , Wanying Ding2 , Ran Xu3 and Xiaohua Hu1 1Drexel University, College of Computing & Informatics, Philadelphia, PA, USA 2JPMorgan Chase & Co., Palo Alto, CA, USA 3Salesforce Research, Palo Alto, CA, USA yz559@drexel.edu, wanying.alice@gmail.com, xurantju@gmail.com, xh29@drexel.edu
Pseudocode No The paper describes the model architecture and training process using text and diagrams (Figure 2, Figure 3), but it does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks.
Open Source Code No The paper does not provide any specific links or explicit statements about the release of source code for the described methodology.
Open Datasets Yes Stock Emotion [Wei et al., 2020] is collected in several steps. The authors first search Adobe Stock with various emotion keywords and concepts, and rank the words associated with the images... Deep Emotion [You et al., 2016] contains 23,815 images labeled with eight emotion categories... Emotion6 [Peng et al., 2015] is collected by searching images related to six emotion keywords and their synonyms... Un Biased Emotion [Panda et al., 2018] contains 3045 images with six emotion categories collected from Google. EMOTIC [Kosti et al., 2020] contains 23,571 images and 34,320 annotated people with body and face bounding boxes.
Dataset Splits Yes We follow the split setting in [Panda et al., 2018] and [Wei et al., 2020], randomly select 80% data for training and the rest 20% data for testing. We follow the training setup in [Yang et al., 2018a] with 80% data for training and 20% for testing. we use 27k images for our downstream task following [Wei et al., 2020], with 22k training images and 5k testing images. We follow the setting from [Wei et al., 2020] and [Panda et al., 2018] with 80% images for training and 20% for testing. We follow the original split provided by the authors with 70% images for training, 10% for validation, and 20% for testing.
Hardware Specification Yes We pre-train our model with 8 NVIDIA V100 GPUs, and the batch size of 384 image-text pairs.
Software Dependencies No The paper mentions software components like 'BERT base model (bert-base-uncased)', 'Mask-RCNN', and 'Adam W', but it does not provide specific version numbers for these dependencies.
Experiment Setup Yes We pre-train our model with 8 NVIDIA V100 GPUs, and the batch size of 384 image-text pairs. The learning rate is set as 2e 5 and we train for 30 epochs with Adam W. We set a maximum number of token sequence (including both visual features and words) length to 100.