Conference Name
Geo-R1: Improving Few-Shot Geospatial Referring Expression Understanding with Reinforcement Fine-Tuning
Reasoning Models and Few-shot Benchmarks for Remote Sensing Referring Expression Understanding Tasks.
Abstract
Referring expression understanding in remote sensing poses unique challenges, as it requires reasoning over complex object–context relationships. While supervised fine-tuning (SFT) on multimodal large language models achieves strong performance with massive labeled datasets, they struggle in data-scarce scenarios, leading to poor generalization. To address this limitation, we propose Geo-R1, a reasoning-centric reinforcement fine-tuning (RFT) paradigm for few-shot geospatial referring. Geo-R1 enforces the model to first generate explicit, interpretable reasoning chains that decompose referring expressions, and then leverage these rationales to localize target objects. This ``reason first, then act" process enables the model to make more effective use of limited annotations, enhances generalization, and provides interpretability. We validate Geo-R1 on three carefully designed few-shot geospatial referring benchmarks, where our model consistently and substantially outperforms SFT baselines. It also demonstrates strong cross-dataset generalization, highlighting its robustness.
Referring Expression Understanding

An overview of Referring Expression Understanding task
Tasks such as REC, RES, Generalized REC (GREC), Generalized RES (GRES), Visual Grounding (VG), Open-Vocabulary Detection (OVD), and Open-Vocabulary Segmentation (OVS) can all be seen as specialized forms of REU, each with a different emphasis.
Dataset

We do not partition the dataset into base and novel classes. Instead, we treat all classes as novel and provide only a few labeled examples per class. We construct instruction-following few-shot datasets for the FS-GREC and FS-GRES tasks by deriving them from the training sets of three widely used remote sensing benchmarks: VRSBench, NWPU VHR-10, and EarthReason. Configurations and statistics are summarized in the table above. For the OVD task, we select four classes on which the baseline model (Qwen2.5-VL-3B) demonstrated decent performance. We select all categories from the training set for other tasks. The low-shot dataset is a subset of the high-shot dataset. To evaluate cross-dataset generalization, we further evaluate zero-shot performance on DIOR-RSVG and RRSIS-D datasets.
Reward
Reward Design
Following DeepSeek-R1, the reward function of Geo-R1 includes a task-agnostic format reward and a task-specific metrics reward.

Fig 2: Reward Design
Experimental Results
We evaluated our method on several public datasets and compared it with state-of-the-art approaches. The data for the low-shot setting is a subset of the data for the many-shot setting.
Qualitative Results

Sample Fig 1: Demo for GRES task.

Sample Fig 2: Demo for GREC task.
Quantitative Results (FS-REC)

Table 2: Comparison with state-of-the-art methods on VRSBench Dataset.
Table 2 compares models trained on the full VRSBench (Full Amount Fine-tune) against few-shot models (1/5/10-shot Fine-tune). The few-shot results include both SFT-based models and our RL-tuned models. Performance data for the full-data baselines (except Qwen2.5-VL) are taken from the original VRSBench paper. The results reveal a clear performance hierarchy: RL-based post-training methods consistently and significantly outperform the SFT approach across all settings and metrics. This advantage is substantial; for example, in the 10-shot overall setting, our GRPO-based model achieves an Acc@0.5 score 12.30% higher than its SFT counterpart. Remarkably, our 10-shot GRPO model using only 260 samples, 0.71% data, achieves a score that surpasses all evaluated models (except Qwen2.5-VL) trained on all 36,313 samples.
Quantitative Results (Cross Dataset)

Table 5: Cross Dataset Evaluation
We further assess the cross-dataset generalization of the SFT and GRPO approaches on the FS-GREC and FS-GRES tasks. For the FS-GREC task, we fine-tune models on the VRSBench dataset with limited supervision (1, 5, and 10-shot) and then evaluate model performance on the DIOR-RSVG target dataset, in a zero-shot manner. As shown in Table 5, GRPO consistently outperforms SFT across all settings, achieving a performance advantage of 4.92%, 6.05%, and 5.52% in the 1-shot, 5-shot, and 10-shot scenarios, respectively. Similarly, for the FS-GRES task, models were tuned on the EarthReason dataset (1, 5, and 10-shot) and tested on the RRSIS-D dataset. Here, the GRPO-based model (Geo-R1) demonstrates a remarkable improvement over the SFT-based model (SegEarth-R1) under few-shot setting, achieving a relative improvement up to 80%. These results highlight GRPO's incredible cross-dataset generalization, indicating superior transferability and robustness of Geo-R1.
Citation
@inproceedings{YourLastName2025AwesomeTitle,
title={Your Awesome Paper Title Here},
author={Your Name and Collaborator Name and Supervisor Name},
booktitle={Conference Name},
year={2025}
}