DDAE: Towards Deep Dynamic Vision BERT Pretraining
Authors: Honghao Chen, Xiangwen Kong, Xiangyu Zhang, Xin Zhao, Kaiqi Huang
AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments across a variety of vision tasks including Image Net classification, semantic segmentation on ADE20K and object detection on COCO demonstrate the effectiveness of our approach. |
| Researcher Affiliation | Collaboration | Honghao Chen1,2*, Xiangwen Kong3, Xiangyu Zhang3, Xin Zhao1,2, Kaiqi Huang1,2,4 1CRISE, Institute of Automation, Chinese Academy of Sciences 2School of Artificial Intelligence, University of Chinese Academy of Sciences 3MEGVII Technology 4CAS Center for Excellence in Brain Science and Intelligence Technology chenhonghao2021@ia.ac.cn, {kongxiangwen, zhangxiangyu}@megvii.com, {xzhao, kaiqi.huang}@nlpr.ac.cn |
| Pseudocode | Yes | Algorithm 1: Pseudocode of Dynamic Loss |
| Open Source Code | No | The paper does not contain an explicit statement about releasing source code or a link to a code repository for the methodology described. |
| Open Datasets | Yes | We conduct experiments on Image Net-1K without labels as the pretraining data for self-supervised learning. The input resolution is set as 224 224 during pretraining and partitioned into 16 16 size patches... semantic segmentation on ADE20K and object detection on COCO. |
| Dataset Splits | No | The paper mentions using ImageNet-1K, ADE20K, and COCO datasets but does not explicitly provide the training/validation/test splits or a detailed splitting methodology for its experiments, nor does it refer to specific predefined splits with citations. |
| Hardware Specification | Yes | We compare pre-training time cost and Image Net top-1 accuracy on Vi T-Base using NVIDIA V100 GPUs. |
| Software Dependencies | No | The paper does not specify software dependencies with version numbers (e.g., 'PyTorch 1.9' or 'CUDA 11.1'). |
| Experiment Setup | Yes | The input resolution is set as 224 224 during pretraining and partitioned into 16 16 size patches. ... We use block-wise masking with a ratio of 75%. The data augmentation is only standard random cropping and horizontal flipping. All β are initialized as 0.5 by default. ... m [0, 1) is a momentum coefficient and set to 0.9999 by default. ... λ are the scale factors to tune the L2 regularization term and is set to 0.1 gi by default. |