Geo-R1

Conference Name

Geo-R1: Improving Few-Shot Geospatial Referring Expression Understanding with Reinforcement Fine-Tuning

Reasoning Models and Few-shot Benchmarks for Remote Sensing Referring Expression Understanding Tasks.

Anonymous¹ Anonymous² Anonymous¹ Anonymous²
¹Anonymous Institution ²Anonymous Institution
Teaser figure for the paper

Geo-R1 method overview. Geo-R1 is trained on a few labeled samples with reinforcement learning (e.g., GRPO) and can identify target objects (bounding boxes or masks) from an input image and text query while providing the reasoning process.

Abstract

Referring expression understanding in remote sensing poses unique challenges, as it requires reasoning over complex object–context relationships. While supervised fine-tuning (SFT) on multimodal large language models achieves strong performance with massive labeled datasets, they struggle in data-scarce scenarios, leading to poor generalization. To address this limitation, we propose Geo-R1, a reasoning-centric reinforcement fine-tuning (RFT) paradigm for few-shot geospatial referring. Geo-R1 enforces the model to first generate explicit, interpretable reasoning chains that decompose referring expressions, and then leverage these rationales to localize target objects. This ``reason first, then act" process enables the model to make more effective use of limited annotations, enhances generalization, and provides interpretability. We validate Geo-R1 on three carefully designed few-shot geospatial referring benchmarks, where our model consistently and substantially outperforms SFT baselines. It also demonstrates strong cross-dataset generalization, highlighting its robustness.

Referring Expression Understanding

Figure for Referring Expression Understanding

An overview of Referring Expression Understanding task

Tasks such as REC, RES, Generalized REC (GREC), Generalized RES (GRES), Visual Grounding (VG), Open-Vocabulary Detection (OVD), and Open-Vocabulary Segmentation (OVS) can all be seen as specialized forms of REU, each with a different emphasis.

Dataset

Dataset statistics figure

We do not partition the dataset into base and novel classes. Instead, we treat all classes as novel and provide only a few labeled examples per class. We construct instruction-following few-shot datasets for the FS-GREC and FS-GRES tasks by deriving them from the training sets of three widely used remote sensing benchmarks: VRSBench, NWPU VHR-10, and EarthReason. Configurations and statistics are summarized in the table above. For the OVD task, we select four classes on which the baseline model (Qwen2.5-VL-3B) demonstrated decent performance. We select all categories from the training set for other tasks. The low-shot dataset is a subset of the high-shot dataset. To evaluate cross-dataset generalization, we further evaluate zero-shot performance on DIOR-RSVG and RRSIS-D datasets.

Reward

Reward Design

Following DeepSeek-R1, the reward function of Geo-R1 includes a task-agnostic format reward and a task-specific metrics reward.

Diagram of the model architecture

Fig 2: Reward Design

Experimental Results

We evaluated our method on several public datasets and compared it with state-of-the-art approaches. The data for the low-shot setting is a subset of the data for the many-shot setting.

Qualitative Results

First qualitative result

Sample Fig 1: Demo for GRES task.

Second qualitative result

Sample Fig 2: Demo for GREC task.

Quantitative Results (FS-REC)

Quantitative results table

Table 2: Comparison with state-of-the-art methods on VRSBench Dataset.

Table 2 compares models trained on the full VRSBench (Full Amount Fine-tune) against few-shot models (1/5/10-shot Fine-tune). The few-shot results include both SFT-based models and our RL-tuned models. Performance data for the full-data baselines (except Qwen2.5-VL) are taken from the original VRSBench paper. The results reveal a clear performance hierarchy: RL-based post-training methods consistently and significantly outperform the SFT approach across all settings and metrics. This advantage is substantial; for example, in the 10-shot overall setting, our GRPO-based model achieves an Acc@0.5 score 12.30% higher than its SFT counterpart. Remarkably, our 10-shot GRPO model using only 260 samples, 0.71% data, achieves a score that surpasses all evaluated models (except Qwen2.5-VL) trained on all 36,313 samples.

Quantitative Results (Cross Dataset)

Quantitative results table

Table 5: Cross Dataset Evaluation

We further assess the cross-dataset generalization of the SFT and GRPO approaches on the FS-GREC and FS-GRES tasks. For the FS-GREC task, we fine-tune models on the VRSBench dataset with limited supervision (1, 5, and 10-shot) and then evaluate model performance on the DIOR-RSVG target dataset, in a zero-shot manner. As shown in Table 5, GRPO consistently outperforms SFT across all settings, achieving a performance advantage of 4.92%, 6.05%, and 5.52% in the 1-shot, 5-shot, and 10-shot scenarios, respectively. Similarly, for the FS-GRES task, models were tuned on the EarthReason dataset (1, 5, and 10-shot) and tested on the RRSIS-D dataset. Here, the GRPO-based model (Geo-R1) demonstrates a remarkable improvement over the SFT-based model (SegEarth-R1) under few-shot setting, achieving a relative improvement up to 80%. These results highlight GRPO's incredible cross-dataset generalization, indicating superior transferability and robustness of Geo-R1.

Citation

@inproceedings{YourLastName2025AwesomeTitle,
      title={Your Awesome Paper Title Here},
      author={Your Name and Collaborator Name and Supervisor Name},
      booktitle={Conference Name},
      year={2025}
}