<?xml version='1.0' encoding='utf-8'?>
<feed xmlns="http://www.w3.org/2005/Atom">
  <title>RS-Paper-Hub — VLM Papers</title>
  <id>https://rspaper.top/output/feed_vlm.xml</id>
  <link href="https://rspaper.top/output/feed_vlm.xml" rel="self" type="application/atom+xml" />
  <link href="https://rspaper.top" rel="alternate" type="text/html" />
  <updated>2026-05-18T02:08:55Z</updated>
  <subtitle>Latest remote sensing papers (last 7 days) — 9 entries</subtitle>
  <author>
    <name>RS-Paper-Hub</name>
    <uri>https://rspaper.top</uri>
  </author>
  <entry>
    <title>Text-RSIR: A Text-Guided Framework for Efficient Remote Sensing Image Transmission and Reconstruction</title>
    <link href="http://arxiv.org/abs/2605.15558v1" rel="alternate" type="text/html" />
    <id>http://arxiv.org/abs/2605.15558v1</id>
    <published>2026-05-15T00:00:00Z</published>
    <updated>2026-05-15T00:00:00Z</updated>
    <author>
      <name>Hao Yang</name>
    </author>
    <author>
      <name>Xianping Ma</name>
    </author>
    <author>
      <name>Peifeng Ma</name>
    </author>
    <author>
      <name>Man-On Pun</name>
    </author>
    <summary type="text">High-resolution remote sensing imagery is critical for environmental monitoring, urban mapping, and land cover analysis, but its transmission is often hindered by limited bandwidth and high communication costs. Conventional pipelines transmit full-resolution pixel data, resulting in redundant and inefficient delivery. This paper proposes a text-guided remote sensing image transmission system that replaces complete high-resolution data with low-resolution images accompanied by compact textual descriptions. An onboard text generator produces spatial and semantic summaries, reducing the transmitted data volume to approximately 2\% of the original size. For ground-based reconstruction, a text-conditioned image restoration model is introduced, which leverages cross-modal learning to recover fine spatial details and maintain semantic coherence. Experimental results on the Alsat-2B, UC Merced Land Use, and Aerial Image datasets demonstrate that the proposed framework achieves reconstruction PSNRs of 16.36 dB, 26.87 dB, and 27.41 dB, respectively, enabling efficient and information-preserving image transfer for remote sensing applications. The implementation will be made publicly available at \href{https://github.com/haoyangofficial/textrssr}{GitHub}.</summary>
    <content type="html">&lt;p&gt;&lt;strong&gt;Code:&lt;/strong&gt; &lt;a href="https://github.com/haoyangofficial/textrssr"&gt;https://github.com/haoyangofficial/textrssr&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Publication:&lt;/strong&gt; 15 pages, 8 figures&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Category:&lt;/strong&gt; Method&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Tasks:&lt;/strong&gt; SR&lt;/p&gt;</content>
    <category term="Image and Video Processing" />
    <category term="Computer Vision" />
  </entry>
  <entry>
    <title>GeoVista: Visually Grounded Active Perception for Ultra-High-Resolution Remote Sensing Understanding</title>
    <link href="http://arxiv.org/abs/2605.14475v1" rel="alternate" type="text/html" />
    <id>http://arxiv.org/abs/2605.14475v1</id>
    <published>2026-05-14T00:00:00Z</published>
    <updated>2026-05-14T00:00:00Z</updated>
    <author>
      <name>Jiashun Zhu</name>
    </author>
    <author>
      <name>Ronghao Fu</name>
    </author>
    <author>
      <name>Jiasen Hu</name>
    </author>
    <author>
      <name>Nachuan Xing</name>
    </author>
    <author>
      <name>Xu Na</name>
    </author>
    <author>
      <name>Xiao Yang</name>
    </author>
    <author>
      <name>Zhiwen Lin</name>
    </author>
    <author>
      <name>Weipeng Zhang</name>
    </author>
    <author>
      <name>Lang Sun</name>
    </author>
    <author>
      <name>Zhiheng Xue</name>
    </author>
    <author>
      <name>Haoran Liu</name>
    </author>
    <author>
      <name>Weijie Zhang</name>
    </author>
    <author>
      <name>Bo Yang</name>
    </author>
    <summary type="text">Interpreting ultra-high-resolution (UHR) remote sensing images requires models to search for sparse and tiny visual evidence across large-scale scenes. Existing remote sensing vision-language models can inspect local regions with zooming and cropping tools, but most exploration strategies follow either a one-shot focus or a single sequential trajectory. Such single-path exploration can lose global context, leave scattered regions unvisited, and revisit or count the same evidence multiple times. To this end, we propose GeoVista, a planning-driven active perception framework for UHR remote sensing interpretation. Instead of committing to one zooming path, GeoVista first builds a global exploration plan, then verifies multiple candidate regions through branch-wise local inspection, while maintaining an explicit evidence state for cross-region aggregation and de-duplication. To enable this behavior, we introduce APEX-GRO, a cold-start supervised trajectory corpus that reformulates diverse UHR tasks as Global-Region-Object interactive reasoning processes with a unified, scale-invariant spatial representation. We further design an Observe-Plan-Track mechanism for global observation, adaptive region inspection, and evidence tracking, and align the model with a GRPO-based strategy using step-wise rewards for planning, localization, and final answer correctness. Experiments on RSHR-Bench, XLRS-Bench, and LRS-VQA show that GeoVista achieves state-of-the-art performance. Code and dataset are available at https://github.com/ryan6073/GeoVista</summary>
    <content type="html">&lt;p&gt;&lt;strong&gt;Code:&lt;/strong&gt; &lt;a href="https://github.com/ryan6073/GeoVista"&gt;https://github.com/ryan6073/GeoVista&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Category:&lt;/strong&gt; Method&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Tasks:&lt;/strong&gt; VQA&lt;/p&gt;</content>
    <category term="Computer Vision" />
  </entry>
  <entry>
    <title>GeoViSTA: Geospatial Vision-Tabular Transformer for Multimodal Environment Representation</title>
    <link href="http://arxiv.org/abs/2605.14406v1" rel="alternate" type="text/html" />
    <id>http://arxiv.org/abs/2605.14406v1</id>
    <published>2026-05-14T00:00:00Z</published>
    <updated>2026-05-14T00:00:00Z</updated>
    <author>
      <name>Yuhao Liu</name>
    </author>
    <author>
      <name>Sadeer Al-Kindi</name>
    </author>
    <author>
      <name>Ashok Veeraraghavan</name>
    </author>
    <author>
      <name>Guha Balakrishnan</name>
    </author>
    <summary type="text">Large-scale pretraining on Earth observation imagery has yielded powerful representations of the natural and built environment. However, most existing geospatial foundation models do not directly model the structured socioeconomic covariates typically stored in tabular form. This modality gap limits their ability to capture the complete total environment, which is critical for reasoning about complex environmental, social, and health-related outcomes. In this work, we propose GeoViSTA (Geospatial Vision-Tabular Transformer), a vision-tabular architecture that learns unified geospatial embeddings from co-registered gridded imagery and tabular data. GeoViSTA utilizes bilateral cross-attention to exchange spatial and semantic information across modalities, guided by a geography-aware attention mechanism that aligns continuous image patches with irregular census-tract tokens. We train GeoViSTA with a self-supervised joint masked-autoencoding objective, forcing it to recover missing image patches and tabular rows using local spatial context and cross-modal cues. Empirically, GeoViSTA's unified embeddings improve linear probing performance on high-impact downstream tasks, outperforming baselines in predicting disease-specific mortality and fire hazard frequency across held-out regions. These results demonstrate that jointly modeling the physical environment alongside structured socioeconomic context yields highly transferable representations for holistic geospatial inference.</summary>
    <content type="html">&lt;p&gt;&lt;strong&gt;Category:&lt;/strong&gt; Method&lt;/p&gt;</content>
    <category term="Machine Learning" />
    <category term="Computer Vision" />
  </entry>
  <entry>
    <title>HiSem: Hierarchical Semantic Disentangling for Remote Sensing Image Change Captioning</title>
    <link href="http://arxiv.org/abs/2605.15024v1" rel="alternate" type="text/html" />
    <id>http://arxiv.org/abs/2605.15024v1</id>
    <published>2026-05-14T00:00:00Z</published>
    <updated>2026-05-14T00:00:00Z</updated>
    <author>
      <name>Man Wang</name>
    </author>
    <author>
      <name>Chenyang Liu</name>
    </author>
    <author>
      <name>Wenjun Li</name>
    </author>
    <author>
      <name>Feng Ni</name>
    </author>
    <author>
      <name>Bing Jia</name>
    </author>
    <author>
      <name>Baoqi Huang</name>
    </author>
    <author>
      <name>Riting Xia</name>
    </author>
    <author>
      <name>Zhenwei Shi</name>
    </author>
    <summary type="text">Remote sensing image change captioning (RSICC) aims to achieve high-level semantic understanding of genuine changes occurring between bi-temporal images. Despite notable progress, existing methods are fundamentally limited by a shared modeling assumption: changed and unchanged image pairs, which have intrinsically different semantic granularities, are processed under a unified modeling strategy. This modeling inconsistency leads to semantic entanglement between coarse-grained change existence judgment and fine-grained semantic understanding.To address the above limitation, we propose a novel hierarchical semantic disentangling network (HiSem) that explicitly disentangles semantic representations of different granularities. Specifically, we first introduce the Bidirectional Differential Attention Modulation (BDAM) module that leverages discrepancy-aware attention to enhance cross-temporal interactions, thereby amplifying true change signals while suppressing irrelevant variations. Building upon this, we design a Hierarchical Adaptive Semantic Disentanglement (HASD) module that performs adaptive routing at two hierarchical levels: a coarse-grained image-level routing mechanism distinguishes changed and unchanged image pairs, while a fine-grained token-level Mixture-of-Experts (MoE) block models diverse and heterogeneous change semantics for changed samples. Extensive experiments on two benchmark datasets demonstrate that HiSem outperfoms previous methods, achieving a significant improvement of +7.52\% BLEU-4 on the WHU-CDC dataset. More importantly, our approach provides a structured perspective for RSICC by explicitly aligning model design with the intrinsic semantic heterogeneity of bi-temporal scenes. The code will be available at https://github.com/Man-Wang-star/HiSem</summary>
    <content type="html">&lt;p&gt;&lt;strong&gt;Code:&lt;/strong&gt; &lt;a href="https://github.com/Man-Wang-star/HiSem"&gt;https://github.com/Man-Wang-star/HiSem&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Category:&lt;/strong&gt; Method&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Tasks:&lt;/strong&gt; IC;CD&lt;/p&gt;</content>
    <category term="Computer Vision" />
  </entry>
  <entry>
    <title>ELDOR: A Dataset and Benchmark for Illegal Gold Mining in the Amazon Rainforest</title>
    <link href="http://arxiv.org/abs/2605.15397v1" rel="alternate" type="text/html" />
    <id>http://arxiv.org/abs/2605.15397v1</id>
    <published>2026-05-14T00:00:00Z</published>
    <updated>2026-05-14T00:00:00Z</updated>
    <author>
      <name>Kangning Cui</name>
    </author>
    <author>
      <name>Surendra Bohara</name>
    </author>
    <author>
      <name>Suraj Prasai</name>
    </author>
    <author>
      <name>Zishan Shao</name>
    </author>
    <author>
      <name>Wei Tang</name>
    </author>
    <author>
      <name>Martin Pillaca</name>
    </author>
    <author>
      <name>Edwin Flores</name>
    </author>
    <author>
      <name>Zhen Yang</name>
    </author>
    <author>
      <name>Gregory Larsen</name>
    </author>
    <author>
      <name>Evan Dethier</name>
    </author>
    <author>
      <name>David Lutz</name>
    </author>
    <author>
      <name>Jean-Michel Morel</name>
    </author>
    <author>
      <name>Miles Silman</name>
    </author>
    <author>
      <name>Victor Pauca</name>
    </author>
    <author>
      <name>Fan Yang</name>
    </author>
    <summary type="text">Illegal gold mining in the Amazon rainforest causes deforestation, water contamination, and long-term ecosystem disruption, yet remains difficult to monitor at fine spatial scales. Satellite imagery supports large-scale observation, but often misses small mining-related structures and subtle land-cover transitions, especially under frequent cloud cover. We introduce ELDOR, a large-scale UAV benchmark for monitoring environmental and landscape disturbance from illegal gold mining in the rainforest. ELDOR contains manually annotated orthomosaic imagery covering over 2,500 hectares, with pixel-level semantic labels for both mining-related activities and surrounding ecological structures. With this unified annotation source, we establish four benchmark tasks: semantic segmentation, segmentation-derived recognition, direct multi-label classification, and class-presence recognition with vision-language models. Across these tasks, we compare generic and remote-sensing-specific segmentation models, vision foundation model-related segmentation methods, direct multi-label classification methods, and vision-language models under a controlled closed-set protocol. Results show that current methods still struggle with rare small-scale mining structures and fine-grained recovery classes, suggesting the need for context-aware and multimodal modeling. To support domain analysis and practical use, we further build an interactive explorer for domain experts that provides a unified interface for data exploration and model inference.</summary>
    <content type="html">&lt;p&gt;&lt;strong&gt;Publication:&lt;/strong&gt; 70 pages, 35 figures&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Category:&lt;/strong&gt; Dataset&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Tasks:&lt;/strong&gt; CLS;SEG&lt;/p&gt;</content>
    <category term="Computer Vision" />
  </entry>
  <entry>
    <title>RS-Claw: Progressive Active Tool Exploration via Hierarchical Skill Trees for Remote Sensing Agents</title>
    <link href="http://arxiv.org/abs/2605.13391v1" rel="alternate" type="text/html" />
    <id>http://arxiv.org/abs/2605.13391v1</id>
    <published>2026-05-13T00:00:00Z</published>
    <updated>2026-05-13T00:00:00Z</updated>
    <author>
      <name>Liangtian Liu</name>
    </author>
    <author>
      <name>Zeyuan Wang</name>
    </author>
    <author>
      <name>Ziyu Li</name>
    </author>
    <author>
      <name>Kai Ouyang</name>
    </author>
    <author>
      <name>Zichao Tang</name>
    </author>
    <author>
      <name>Chengfu Liu</name>
    </author>
    <author>
      <name>Haifeng Li</name>
    </author>
    <author>
      <name>Hanwen Yu</name>
    </author>
    <author>
      <name>Wentao Yang</name>
    </author>
    <author>
      <name>Cheng Yang</name>
    </author>
    <author>
      <name>Dongyang Hou</name>
    </author>
    <summary type="text">The rise of multi-modal large language models (MLLMs) is shifting remote sensing (RS) intelligence from "see" to "action", as OpenClaw-style frameworks enable agents to autonomously operate massive RS image-processing tools for complex tasks. Existing RS agents adopt a passive selection paradigm for tool invocation, relying on either full tool registration (Flat) or retrieval-augmented generation (RAG). However, in the massive and multi-source heterogeneous RS tool ecosystem, such passive mechanisms struggle to dynamically balance "context load" and "toolset completeness" throughout task reasoning, thus exhibiting inherent limitations: full tool registration triggers context space deficits during long-horizon tasks, whereas RAG retrieval may omit critical tools in essential steps. To overcome these bottlenecks, this paper redefines tool selection by arguing that the agent should act as an active explorer within the tool space. Based on this perspective, we propose RS-Claw, a novel RS agent architecture. By leveraging Skill encapsulation technology at the tool end, this architecture hierarchically structures tool descriptions, enabling the agent to execute on-demand sequential decision-making: initially selecting relevant skill branches by reading only tool summaries, then dynamically loading detailed descriptions, and ultimately achieving precise invocation. This active paradigm not only significantly liberates the agent's context space but also effectively ensures the accurate hit rate of critical tools during long-horizon reasoning. Systematic experiments on the Earth-Bench benchmark demonstrate that RS-Claw's active exploration mechanism effectively filters semantic noise and substantially frees up reasoning space, achieving an input token compression ratio of up to 86%, and comprehensively outperforming existing Flat and RAG baselines across complex reasoning evaluations.</summary>
    <content type="html">&lt;p&gt;&lt;strong&gt;Category:&lt;/strong&gt; Method&lt;/p&gt;</content>
    <category term="Artificial Intelligence" />
  </entry>
  <entry>
    <title>GeoR-Bench: Evaluating Geoscience Visual Reasoning</title>
    <link href="http://arxiv.org/abs/2605.11541v1" rel="alternate" type="text/html" />
    <id>http://arxiv.org/abs/2605.11541v1</id>
    <published>2026-05-12T00:00:00Z</published>
    <updated>2026-05-12T00:00:00Z</updated>
    <author>
      <name>Yushuo Zheng</name>
    </author>
    <author>
      <name>Zicheng Zhang</name>
    </author>
    <author>
      <name>Huiyu Duan</name>
    </author>
    <author>
      <name>Chunyi Li</name>
    </author>
    <author>
      <name>Zijian Chen</name>
    </author>
    <author>
      <name>Ziheng Jia</name>
    </author>
    <author>
      <name>Yue Shi</name>
    </author>
    <author>
      <name>Ke Gu</name>
    </author>
    <author>
      <name>Xiongkuo Min</name>
    </author>
    <author>
      <name>Guangtao Zhai</name>
    </author>
    <summary type="text">Geoscience intelligence is expected to understand, reason about, and predict earth system changes to support human decision-making in critical domains such as disaster response, climate adaptation and environmental protection. Although current research has shown promising progress on specific geoscience tasks, such as remote sensing interpretation, geographic question-answering, existing benchmarks remain largely task-specific which failing to capture the open-ended real world geoscience problems. As a result, it remains unclear how far current AI systems are from achieving genuine geoscience intelligence. To address this gap, we present \textbf{GeoR-Bench}, a \underline{Bench}mark for evaluating \underline{Geo}science visual \underline{R}easoning through reasoning informed visual editing tasks. GeoR-Bench contains 440 curated samples spanning 6 geoscience categories and 24 task types, covering earth observation imagery and structured scientific representations such as maps and diagrams. We evaluate outputs along three dimensions, including reasoning, consistency, and quality. Benchmark results of 21 closed- and open-source multimodal models reveal that geoscience reasoning remains a critical bottleneck. The highest-performing model achieves 42.7\% overall strict accuracy, while the best open-source models only get 10.3\%. Notably, the visual consistency and image quality of the outputs frequently surpass their scientific accuracy. Ultimately, these findings indicate that current models generate superficially plausible results but fail to capture underlying earth science processes.</summary>
    <content type="html">&lt;p&gt;&lt;strong&gt;Category:&lt;/strong&gt; Method&lt;/p&gt;</content>
    <category term="Computer Vision" />
  </entry>
  <entry>
    <title>UHR-Micro: Diagnosing and Mitigating the Resolution Illusion in Earth Observation VLMs</title>
    <link href="http://arxiv.org/abs/2605.12237v1" rel="alternate" type="text/html" />
    <id>http://arxiv.org/abs/2605.12237v1</id>
    <published>2026-05-12T00:00:00Z</published>
    <updated>2026-05-12T00:00:00Z</updated>
    <author>
      <name>Shuo Ni</name>
    </author>
    <author>
      <name>Tong Wang</name>
    </author>
    <author>
      <name>Jing Zhang</name>
    </author>
    <author>
      <name>He Chen</name>
    </author>
    <author>
      <name>Haonan Guo</name>
    </author>
    <author>
      <name>Ning Zhang</name>
    </author>
    <author>
      <name>Bo Du</name>
    </author>
    <summary type="text">Vision-Language Models (VLMs) increasingly operate on ultra-high-resolution (UHR) Earth observation imagery, yet they remain vulnerable to a severe scale mismatch between large-scale scene context and micro-scale targets. We refer to this empirical gap as a "resolution illusion": higher input resolution provides the appearance of richer visual detail, but does not necessarily yield reliable perception of spatially small, task-relevant evidence. To benchmark this challenge, we introduce UHR-Micro, a benchmark comprising 11,253 instructions grounded in 1,212 UHR images, designed to evaluate VLMs at the spatial limits of native Earth observation imagery. UHR-Micro spans diverse micro-target scales, context requirements, task families, and visual conditions, and provides diagnostic annotations that support controlled evaluation and fine-grained error attribution. Experiments with representative high-resolution VLMs show substantial failures in spatial grounding and evidence parsing, despite access to high-resolution inputs. Further analysis suggests that these failures are not fully resolved by increasing model capacity, but are closely tied to insufficient guidance in locating and using task-relevant micro-evidence. Motivated by this finding, we propose Micro-evidence Active Perception (MAP), a reference agent that decomposes queries into evidence-seeking steps, actively inspects candidate regions, and grounds its answers in localized observations. MAP-Agent improves micro-level perception by making high-resolution reasoning evidence-centered rather than image-centered. Together, UHR-Micro and MAP-Agent provide a diagnostic platform for evaluating, understanding, and advancing high-resolution reasoning in Earth observation VLMs. Datasets and source code were released at https://github.com/MiliLab/UHR-Micro.</summary>
    <content type="html">&lt;p&gt;&lt;strong&gt;Code:&lt;/strong&gt; &lt;a href="https://github.com/MiliLab/UHR-Micro"&gt;https://github.com/MiliLab/UHR-Micro&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Category:&lt;/strong&gt; Method&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Tasks:&lt;/strong&gt; VG&lt;/p&gt;</content>
    <category term="Computer Vision" />
  </entry>
  <entry>
    <title>TAR: Text Semantic Assisted Cross-modal Image Registration Framework for Optical and SAR Images</title>
    <link href="http://arxiv.org/abs/2605.12064v1" rel="alternate" type="text/html" />
    <id>http://arxiv.org/abs/2605.12064v1</id>
    <published>2026-05-12T00:00:00Z</published>
    <updated>2026-05-12T00:00:00Z</updated>
    <author>
      <name>Zhuoyu Cai</name>
    </author>
    <author>
      <name>Dou Quan</name>
    </author>
    <author>
      <name>Ning Huyan</name>
    </author>
    <author>
      <name>Pei He</name>
    </author>
    <author>
      <name>Shuang Wang</name>
    </author>
    <author>
      <name>Licheng Jiao</name>
    </author>
    <summary type="text">Existing deep learning-based methods can capture shared features from optical and synthetic aperture radar (SAR) images for spatial alignment. However, optical-SAR registration remains challenging under large geometric deformations, because the model needs to simultaneously handle cross-modal appearance discrepancies and complex spatial transformations. To address this issue, this paper proposes a text semantic-assisted cross-modal image registration framework, named TAR, for optical and SAR images. TAR exploits text semantic priors from remote sensing scenes and land-cover categories to alleviate the modality gap and enhance cross-modal feature learning. TAR consists of three components: a multi-scale visual feature learning (MSFL) module, a text-assisted feature enhancement (TAFE) module, and a coarse-to-fine dense matching (CFDM) module. MSFL extracts multi-scale visual features from optical and SAR images. TAFE constructs text descriptors related to remote sensing scenes and land-cover objects, and uses a frozen RemoteCLIP text encoder to extract text features. These text features are introduced through visual-text interaction to enhance high-level visual features for more reliable coarse matching. CFDM then establishes coarse correspondences based on the enhanced high-level features and refines the matched locations using low-level features. Experimental results on cross-modal remote sensing images demonstrate the effectiveness of TAR, which achieves stronger matching performance than several state-of-the-art methods and yields significant gains under large geometric deformations.</summary>
    <content type="html">&lt;p&gt;&lt;strong&gt;Category:&lt;/strong&gt; Method&lt;/p&gt;</content>
    <category term="Computer Vision" />
  </entry>
</feed>