reproducibilityindex.ai

EarthVQA: Towards Queryable Earth via Relational Reasoning-Based Remote Sensing Visual Question Answering

Authors: Junjue Wang, Zhuo Zheng, Zihang Chen, Ailong Ma, Yanfei Zhong

AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experimental results show that SOBA outperforms both advanced general and remote sensing methods. We believe this dataset and framework provide a strong benchmark for Earth vision s complex analysis.
Researcher Affiliation	Academia	Junjue Wang1, Zhuo Zheng2, Zihang Chen1, Ailong Ma1, Yanfei Zhong1 1LIESMARS, Wuhan University, 430074, China 2Department of Computer Science, Stanford University, Stanford, CA 94305, USA
Pseudocode	No	The paper describes the proposed framework and loss function with mathematical formulations but does not contain any blocks explicitly labeled as 'Pseudocode' or 'Algorithm'.
Open Source Code	Yes	The project page is at https://Junjue-Wang.github.io/homepage/Earth VQA.
Open Datasets	Yes	The Earth VQA dataset was extended from the Love DA dataset (Wang et al. 2021), which encompasses 18 urban and rural regions from Nanjing, Changzhou, and Wuhan.
Dataset Splits	Yes	Following the balanced division (Wang et al. 2021), train set includes 2522 images with 88166 QAs, val set includes 1669 images with 57202 QAs, and test set includes 1809 images with 63225 QAs.
Hardware Specification	Yes	All experiments were performed under Py Torch framework using one RTX 3090 GPU.
Software Dependencies	No	The paper mentions 'Py Torch framework' but does not specify a version number or list other software dependencies with their versions.
Experiment Setup	Yes	All VQA models were trained for 40k steps with a batch size of 16. We set the two-layer LSTM with the hidden size of 384 and Res Net50 as default. ... We used Adam solver with β1 = 0.9 and β2 = 0.999. The initial learning rate was set to 5e−5, and a poly schedule with a power of 0.9 was applied. The hidden size of the language and image features was dm = 384. The number of heads M is set to 8, and the numbers of layers in self- and cross-attention modules are NE = ND = 3. We set α = 1 and γ = 0.5 for ND loss.