Box2Poly: Memory-Efficient Polygon Prediction of Arbitrarily Shaped and Rotated Text

Authors: Xuyang Chen, Dong Wang, Konrad Schindler, Mingwei Sun, Yongliang Wang, Nicolo Savioli, Liqiu Meng

AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments Datasets The datasets involved in the experiment are Synth Text 150K, Total Text, CTW1500, ICDAR19 MLT and Inverse Text.
Researcher Affiliation Collaboration Xuyang Chen1,2, Dong Wang1*, Konrad Schindler3, Mingwei Sun1,4, Yongliang Wang1, Nicolo Savioli1, Liqiu Meng2 1Riemann Lab, Huawei 2Technical University of Munich 3ETH Zurich 4Wuhan University
Pseudocode No The paper does not contain structured pseudocode or algorithm blocks.
Open Source Code Yes The code is available at https://github.com/Albertchen98/Box2Poly.git.
Open Datasets Yes The datasets involved in the experiment are Synth Text 150K, Total Text, CTW1500, ICDAR19 MLT and Inverse Text. As implied by the name, Synth Text 150K (Liu et al. 2020) collects 150k synthesized scene text images, consisting of 94,723 images with multi-oriented straight texts and 54,327 images with curved ones. Total Text (Ch ng and Chan 2017) contains 1,255 training images and 300 test images with highly diversified orientations and curvatures.
Dataset Splits Yes Total Text (Ch ng and Chan 2017) contains 1,255 training images and 300 test images with highly diversified orientations and curvatures. CTW1500 is another dataset with curved texts including 1,000 training images and 500 test images.
Hardware Specification Yes all models are trained with 8 pieces NVIDIA RTX 3090 GPUs.
Software Dependencies No The paper references various models and frameworks (e.g., Res Net50, Sparse R-CNN, DETR) and discusses their underlying principles but does not provide specific version numbers for software dependencies like programming languages, libraries, or compilers.
Experiment Setup Yes The batch size is set 16 and all models are trained with 8 pieces NVIDIA RTX 3090 GPUs. The final results on Total Text and CTW1500 are reported with the training strategy similar to (Zhang et al. 2022; Ye et al. 2023a): First, the network is pretrained on a combined dataset for 180k iterations with a learning rate 2.5 10 5 that drops at 144k and 162k step. The learning rate drop factor is 10. [...] Our chosen proposal number, denoted as N, is set at 300. Meanwhile, for Bezier curves, the number of sampling points S has been established as 8, yielding 16 vertices for each polygon proposal. The box head employs a multi-stage approach with K layers set at 3, while the polygon head similarly employs a multi-stage structure with M layers also set at 3.