Visual Emotion Representation Learning via Emotion-Aware Pre-training
Authors: Yue Zhang, Wanying Ding, Ran Xu, Xiaohua Hu
IJCAI 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We conduct our pre-training on a large web dataset with noisy tags and fine-tune on smaller visual emotion classification datasets with class label supervision. Our method achieves state-of-the-art performance for visual emotion classification. In this section, we first describe the datasets for pre-training and adaptation in Section 4.1, then introduce baseline methods and experimental results in Section 4.2, we finally conduct ablation studies and summarize in Section 4.3. |
| Researcher Affiliation | Collaboration | Yue Zhang1 , Wanying Ding2 , Ran Xu3 and Xiaohua Hu1 1Drexel University, College of Computing & Informatics, Philadelphia, PA, USA 2JPMorgan Chase & Co., Palo Alto, CA, USA 3Salesforce Research, Palo Alto, CA, USA yz559@drexel.edu, wanying.alice@gmail.com, xurantju@gmail.com, xh29@drexel.edu |
| Pseudocode | No | The paper describes the model architecture and training process using text and diagrams (Figure 2, Figure 3), but it does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks. |
| Open Source Code | No | The paper does not provide any specific links or explicit statements about the release of source code for the described methodology. |
| Open Datasets | Yes | Stock Emotion [Wei et al., 2020] is collected in several steps. The authors first search Adobe Stock with various emotion keywords and concepts, and rank the words associated with the images... Deep Emotion [You et al., 2016] contains 23,815 images labeled with eight emotion categories... Emotion6 [Peng et al., 2015] is collected by searching images related to six emotion keywords and their synonyms... Un Biased Emotion [Panda et al., 2018] contains 3045 images with six emotion categories collected from Google. EMOTIC [Kosti et al., 2020] contains 23,571 images and 34,320 annotated people with body and face bounding boxes. |
| Dataset Splits | Yes | We follow the split setting in [Panda et al., 2018] and [Wei et al., 2020], randomly select 80% data for training and the rest 20% data for testing. We follow the training setup in [Yang et al., 2018a] with 80% data for training and 20% for testing. we use 27k images for our downstream task following [Wei et al., 2020], with 22k training images and 5k testing images. We follow the setting from [Wei et al., 2020] and [Panda et al., 2018] with 80% images for training and 20% for testing. We follow the original split provided by the authors with 70% images for training, 10% for validation, and 20% for testing. |
| Hardware Specification | Yes | We pre-train our model with 8 NVIDIA V100 GPUs, and the batch size of 384 image-text pairs. |
| Software Dependencies | No | The paper mentions software components like 'BERT base model (bert-base-uncased)', 'Mask-RCNN', and 'Adam W', but it does not provide specific version numbers for these dependencies. |
| Experiment Setup | Yes | We pre-train our model with 8 NVIDIA V100 GPUs, and the batch size of 384 image-text pairs. The learning rate is set as 2e 5 and we train for 30 epochs with Adam W. We set a maximum number of token sequence (including both visual features and words) length to 100. |