Deeply Tensor Compressed Transformers for End-to-End Object Detection
Authors: Peining Zhen, Ziyang Gao, Tianshu Hou, Yuan Cheng, Hai-Bao Chen4716-4724
AAAI 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We conduct extensive experiments on the COCO dataset to validate the effectiveness of our tensor-compressed (tensorized) DETR models. The experimental results show that we can attain 3.7 full model compression with 482 feed forward network (FFN) parameter reduction and only 0.6 points accuracy drop. |
| Researcher Affiliation | Academia | Shanghai Jiao Tong University, China {zhenpn, gaoziyang, houtianshu, cyuan328, haibaochen}@sjtu.edu.cn |
| Pseudocode | No | The paper describes its methods in detail through narrative text and diagrams, but it does not include any clearly labeled pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not include any explicit statement about releasing source code for the described methodology, nor does it provide a link to a code repository. |
| Open Datasets | Yes | COCO (Lin et al. 2014) is used to validate our proposed method, which contains 118k training images and 5k validation images. Image Net-1k (Deng et al. 2009) is leveraged to calibrate and validate our quantization compressed backbone. |
| Dataset Splits | Yes | COCO (Lin et al. 2014) is used to validate our proposed method, which contains 118k training images and 5k validation images. Image Net-1k (Deng et al. 2009) is leveraged to calibrate and validate our quantization compressed backbone. The dataset consists of 1.28M training images and 50k validation images from total 1000 semantic categories. |
| Hardware Specification | Yes | All models are trained on 4 NVIDIA GTX 1080Ti GPUs with 2 images per GPU. All the results of speed (FPS) in our experiments are measured under one NVIDIA GTX 1080Ti GPU. |
| Software Dependencies | No | The paper mentions that the baseline model is implemented based on 'mmdetection' and refers to 'Py Torch' in the context of implementation. However, it does not provide specific version numbers for these software components. |
| Experiment Setup | Yes | All models are trained on 4 NVIDIA GTX 1080Ti GPUs with 2 images per GPU. We train the models by using Adam W optimizer with 150 epochs in total. The learning rates of transformer encoderdecoder and the CNN backbone are initialized to 5 × 10−5 and 5 × 10−6 respectively. The weight decay is set to 0.0001. The learning rates are divided by 10 at the decay step 100 epoch. The balancing parameter ρ for the penalty function is set to 0.1. As for the transformer implementations, we leverage 6 encoder layers and 6 decoder layers with embedded dimension 256. Each encoder and decoder layer has 8 attention heads. We leverage a simple data augmentation technique by resizing the input images with the short side ranging from 480 to 800 pixels and the long side at most 1333 pixels. In our experiments, the µ and λ are set to -0.1 and 1.1 respectively. T is selected as 0.33. We randomly sample 300 images from Image Net-1k and COCO training datasets respectively as our calibration datasets. |