DDAE: Towards Deep Dynamic Vision BERT Pretraining

Authors: Honghao Chen, Xiangwen Kong, Xiangyu Zhang, Xin Zhao, Kaiqi Huang

AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments across a variety of vision tasks including Image Net classification, semantic segmentation on ADE20K and object detection on COCO demonstrate the effectiveness of our approach.
Researcher Affiliation Collaboration Honghao Chen1,2*, Xiangwen Kong3, Xiangyu Zhang3, Xin Zhao1,2, Kaiqi Huang1,2,4 1CRISE, Institute of Automation, Chinese Academy of Sciences 2School of Artificial Intelligence, University of Chinese Academy of Sciences 3MEGVII Technology 4CAS Center for Excellence in Brain Science and Intelligence Technology chenhonghao2021@ia.ac.cn, {kongxiangwen, zhangxiangyu}@megvii.com, {xzhao, kaiqi.huang}@nlpr.ac.cn
Pseudocode Yes Algorithm 1: Pseudocode of Dynamic Loss
Open Source Code No The paper does not contain an explicit statement about releasing source code or a link to a code repository for the methodology described.
Open Datasets Yes We conduct experiments on Image Net-1K without labels as the pretraining data for self-supervised learning. The input resolution is set as 224 224 during pretraining and partitioned into 16 16 size patches... semantic segmentation on ADE20K and object detection on COCO.
Dataset Splits No The paper mentions using ImageNet-1K, ADE20K, and COCO datasets but does not explicitly provide the training/validation/test splits or a detailed splitting methodology for its experiments, nor does it refer to specific predefined splits with citations.
Hardware Specification Yes We compare pre-training time cost and Image Net top-1 accuracy on Vi T-Base using NVIDIA V100 GPUs.
Software Dependencies No The paper does not specify software dependencies with version numbers (e.g., 'PyTorch 1.9' or 'CUDA 11.1').
Experiment Setup Yes The input resolution is set as 224 224 during pretraining and partitioned into 16 16 size patches. ... We use block-wise masking with a ratio of 75%. The data augmentation is only standard random cropping and horizontal flipping. All β are initialized as 0.5 by default. ... m [0, 1) is a momentum coefficient and set to 0.9999 by default. ... λ are the scale factors to tune the L2 regularization term and is set to 0.1 gi by default.