BM-NAS: Bilevel Multimodal Neural Architecture Search
Authors: Yihang Yin, Siyu Huang, Xiang Zhang8901-8909
AAAI 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimental results on three multimodal tasks demonstrate the effectiveness and efficiency of the proposed BM-NAS framework. BM-NAS achieves competitive performances with much less search time and fewer model parameters in comparison with the existing generalized multimodal NAS methods. Our code is available at https://github.com/Somedaywilldo/BM-NAS. |
| Researcher Affiliation | Academia | Yihang Yin1, Siyu Huang2, Xiang Zhang3 1Nanyang Technological University 2Harvard University 3The Pennsylvania State University |
| Pseudocode | Yes | Algorithm 1: Bilevel Multimodal NAS (BM-NAS) Result: The genotype of fusion networks. Initialize architecture parameters α, β, γ and model parameters w; Initialize genotype based on α, β, γ, set genotype best = genotype; Construct hypernet based on genotype best; while L not converged do Update ω on training set; Update (α, β, γ) on validation set; Derive upper level genotype based on α, derive lower level genotype based on β, γ; Update hypernet based on genotype; if higher validation accuracy is reached then Update genotype best using genotype; end end Return genotype best; |
| Open Source Code | Yes | Our code is available at https://github.com/Somedaywilldo/BM-NAS. |
| Open Datasets | Yes | MM-IMDB dataset (Arevalo et al. 2017) is a multi-modal dataset collected from the Internet Movie Database, containing posters, plots, genres and other meta information of 25,959 movies. The NTU RGB-D dataset (Shahroudy et al. 2016) is a large scale multimodal action recognition dataset, containing a total of 56,880 samples with 40 subjects, 80 view points, and 60 classes of daily activities. The Ego Gesture dataset (Zhang et al. 2018) is a large scale multimodal gesture recognition dataset, containing 24,161 gesture samples of 83 classes collected from 50 distinct subjects and 6 different scenes. |
| Dataset Splits | Yes | We adopt the original split of the dataset where 15,552 movies are used for training, 2,608 for validation and 7,799 for testing. (MM-IMDB) In detail, we use subjects 1, 4, 8, 13, 15, 17, 19 for training, 2, 5, 9, 14 for validation, and the rest for test. There are 23760, 2519 and 16558 samples in the training, validation, and test dataset, respectively. (NTU RGB-D) There are 14,416 samples for training, 4,768 for validation, and 4,977 for testing. (Ego Gesture) |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., exact GPU/CPU models, processor types, or memory amounts) used for running its experiments. It only mentions 'GPU hours' as a cost metric. |
| Software Dependencies | No | The paper does not provide specific software dependency versions (e.g., library names with version numbers like PyTorch 1.9 or CUDA 11.1) needed to replicate the experiment. |
| Experiment Setup | Yes | For BM-NAS, we adopt a setting of 2 fusion Cells and 1 step/Cell. For inner step representations, we set C = 192, L = 16. (MM-IMDB section) For BM-NAS, we use 2 fusion Cells and 2 Steps/Cell. For inner step representations we set C = 128, L = 8. (NTU RGB-D section) For our BM-NAS, we use 2 fusion Cells and 3 steps/Cell, for inner step representations we set C = 128, L = 8. (Ego Gesture section) |