EarthVQA: Towards Queryable Earth via Relational Reasoning-Based Remote Sensing Visual Question Answering

Authors: Junjue Wang, Zhuo Zheng, Zihang Chen, Ailong Ma, Yanfei Zhong

AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental results show that SOBA outperforms both advanced general and remote sensing methods. We believe this dataset and framework provide a strong benchmark for Earth vision s complex analysis.
Researcher Affiliation Academia Junjue Wang1, Zhuo Zheng2, Zihang Chen1, Ailong Ma1, Yanfei Zhong1 1LIESMARS, Wuhan University, 430074, China 2Department of Computer Science, Stanford University, Stanford, CA 94305, USA
Pseudocode No The paper describes the proposed framework and loss function with mathematical formulations but does not contain any blocks explicitly labeled as 'Pseudocode' or 'Algorithm'.
Open Source Code Yes The project page is at https://Junjue-Wang.github.io/homepage/Earth VQA.
Open Datasets Yes The Earth VQA dataset was extended from the Love DA dataset (Wang et al. 2021), which encompasses 18 urban and rural regions from Nanjing, Changzhou, and Wuhan.
Dataset Splits Yes Following the balanced division (Wang et al. 2021), train set includes 2522 images with 88166 QAs, val set includes 1669 images with 57202 QAs, and test set includes 1809 images with 63225 QAs.
Hardware Specification Yes All experiments were performed under Py Torch framework using one RTX 3090 GPU.
Software Dependencies No The paper mentions 'Py Torch framework' but does not specify a version number or list other software dependencies with their versions.
Experiment Setup Yes All VQA models were trained for 40k steps with a batch size of 16. We set the two-layer LSTM with the hidden size of 384 and Res Net50 as default. ... We used Adam solver with β1 = 0.9 and β2 = 0.999. The initial learning rate was set to 5e−5, and a poly schedule with a power of 0.9 was applied. The hidden size of the language and image features was dm = 384. The number of heads M is set to 8, and the numbers of layers in self- and cross-attention modules are NE = ND = 3. We set α = 1 and γ = 0.5 for ND loss.