<?xml version='1.0' encoding='utf-8'?>
<feed xmlns="http://www.w3.org/2005/Atom">
  <title>RS-Paper-Hub — Agent Papers</title>
  <id>https://rspaper.top/output/feed_agent.xml</id>
  <link href="https://rspaper.top/output/feed_agent.xml" rel="self" type="application/atom+xml" />
  <link href="https://rspaper.top" rel="alternate" type="text/html" />
  <updated>2026-05-18T02:08:55Z</updated>
  <subtitle>Latest remote sensing papers (last 7 days) — 3 entries</subtitle>
  <author>
    <name>RS-Paper-Hub</name>
    <uri>https://rspaper.top</uri>
  </author>
  <entry>
    <title>RS-Claw: Progressive Active Tool Exploration via Hierarchical Skill Trees for Remote Sensing Agents</title>
    <link href="http://arxiv.org/abs/2605.13391v1" rel="alternate" type="text/html" />
    <id>http://arxiv.org/abs/2605.13391v1</id>
    <published>2026-05-13T00:00:00Z</published>
    <updated>2026-05-13T00:00:00Z</updated>
    <author>
      <name>Liangtian Liu</name>
    </author>
    <author>
      <name>Zeyuan Wang</name>
    </author>
    <author>
      <name>Ziyu Li</name>
    </author>
    <author>
      <name>Kai Ouyang</name>
    </author>
    <author>
      <name>Zichao Tang</name>
    </author>
    <author>
      <name>Chengfu Liu</name>
    </author>
    <author>
      <name>Haifeng Li</name>
    </author>
    <author>
      <name>Hanwen Yu</name>
    </author>
    <author>
      <name>Wentao Yang</name>
    </author>
    <author>
      <name>Cheng Yang</name>
    </author>
    <author>
      <name>Dongyang Hou</name>
    </author>
    <summary type="text">The rise of multi-modal large language models (MLLMs) is shifting remote sensing (RS) intelligence from "see" to "action", as OpenClaw-style frameworks enable agents to autonomously operate massive RS image-processing tools for complex tasks. Existing RS agents adopt a passive selection paradigm for tool invocation, relying on either full tool registration (Flat) or retrieval-augmented generation (RAG). However, in the massive and multi-source heterogeneous RS tool ecosystem, such passive mechanisms struggle to dynamically balance "context load" and "toolset completeness" throughout task reasoning, thus exhibiting inherent limitations: full tool registration triggers context space deficits during long-horizon tasks, whereas RAG retrieval may omit critical tools in essential steps. To overcome these bottlenecks, this paper redefines tool selection by arguing that the agent should act as an active explorer within the tool space. Based on this perspective, we propose RS-Claw, a novel RS agent architecture. By leveraging Skill encapsulation technology at the tool end, this architecture hierarchically structures tool descriptions, enabling the agent to execute on-demand sequential decision-making: initially selecting relevant skill branches by reading only tool summaries, then dynamically loading detailed descriptions, and ultimately achieving precise invocation. This active paradigm not only significantly liberates the agent's context space but also effectively ensures the accurate hit rate of critical tools during long-horizon reasoning. Systematic experiments on the Earth-Bench benchmark demonstrate that RS-Claw's active exploration mechanism effectively filters semantic noise and substantially frees up reasoning space, achieving an input token compression ratio of up to 86%, and comprehensively outperforming existing Flat and RAG baselines across complex reasoning evaluations.</summary>
    <content type="html">&lt;p&gt;&lt;strong&gt;Category:&lt;/strong&gt; Method&lt;/p&gt;</content>
    <category term="Artificial Intelligence" />
  </entry>
  <entry>
    <title>Can LLM Agents Respond to Disasters? Benchmarking Heterogeneous Geospatial Reasoning in Emergency Operations</title>
    <link href="http://arxiv.org/abs/2605.11633v1" rel="alternate" type="text/html" />
    <id>http://arxiv.org/abs/2605.11633v1</id>
    <published>2026-05-12T00:00:00Z</published>
    <updated>2026-05-12T00:00:00Z</updated>
    <author>
      <name>Junjue Wang</name>
    </author>
    <author>
      <name>Weihao Xuan</name>
    </author>
    <author>
      <name>Heli Qi</name>
    </author>
    <author>
      <name>Pengyu Dai</name>
    </author>
    <author>
      <name>Kunyi Liu</name>
    </author>
    <author>
      <name>Hongruixuan Chen</name>
    </author>
    <author>
      <name>Zhuo Zheng</name>
    </author>
    <author>
      <name>Junshi Xia</name>
    </author>
    <author>
      <name>Stefano Ermon</name>
    </author>
    <author>
      <name>Naoto Yokoya</name>
    </author>
    <summary type="text">Operational disaster response goes beyond damage assessment, requiring responders to integrate multi-sensor signals, reason over road networks, populations and key facilities, plan evacuations, and produce actionable reports. However, prior work largely isolates remote-sensing perception or evaluates generic tool use, leaving the end-to-end workflows of emergency operations underexplored. In this paper, we introduce Disaster Operational Response Agent benchmark (DORA), the first agentic benchmark for end-to-end disaster response: 515 expert-authored tasks across 45 real-world disaster events spanning 10 types, paired with expert-verified, replayable gold trajectories totaling 3,500 tool-call steps. Tasks span five dimensions that cover the operational disaster-response pipeline: disaster perception, spatial relational analysis, rescue and evacuation planning, temporal evolution reasoning, and multi-modal report synthesis. Agents compose calls from a 108-tool MCP library over heterogeneous geospatial data: optical, SAR, and multi-spectral imagery across single-, bi-, and multi-temporal sequences (0.015-10m GSD), complemented by elevation and social vector layers. We comprehensively evaluate 13 frontier LLMs on our benchmark, revealing three persistent challenges: 1) disaster-domain grounding exposes unique failure modes (damage-semantic grounding, sensor-modality mismatch, and disaster-pipeline composition); 2) agents are doubly bottlenecked by tool selection and argument grounding, where gold tool-order hints improve accuracy by only 1.08-4.40%, and alternative scaffolds yield at most a 3.24% gain; 3) compositional fragility scales with trajectory length, the agent-to-gold gap widening from 7% to 56% on long pipelines. DORA establishes a rigorous testbed for operationally reliable disaster-response agents.</summary>
    <content type="html">&lt;p&gt;&lt;strong&gt;Publication:&lt;/strong&gt; DORA stress-tests LLM agents on real-world disaster operations that demand comprehensive orchestration of 108 specialized tools over heterogeneous geospatial data&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Category:&lt;/strong&gt; Method&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Tasks:&lt;/strong&gt; VG&lt;/p&gt;</content>
    <category term="Artificial Intelligence" />
  </entry>
  <entry>
    <title>UHR-Micro: Diagnosing and Mitigating the Resolution Illusion in Earth Observation VLMs</title>
    <link href="http://arxiv.org/abs/2605.12237v1" rel="alternate" type="text/html" />
    <id>http://arxiv.org/abs/2605.12237v1</id>
    <published>2026-05-12T00:00:00Z</published>
    <updated>2026-05-12T00:00:00Z</updated>
    <author>
      <name>Shuo Ni</name>
    </author>
    <author>
      <name>Tong Wang</name>
    </author>
    <author>
      <name>Jing Zhang</name>
    </author>
    <author>
      <name>He Chen</name>
    </author>
    <author>
      <name>Haonan Guo</name>
    </author>
    <author>
      <name>Ning Zhang</name>
    </author>
    <author>
      <name>Bo Du</name>
    </author>
    <summary type="text">Vision-Language Models (VLMs) increasingly operate on ultra-high-resolution (UHR) Earth observation imagery, yet they remain vulnerable to a severe scale mismatch between large-scale scene context and micro-scale targets. We refer to this empirical gap as a "resolution illusion": higher input resolution provides the appearance of richer visual detail, but does not necessarily yield reliable perception of spatially small, task-relevant evidence. To benchmark this challenge, we introduce UHR-Micro, a benchmark comprising 11,253 instructions grounded in 1,212 UHR images, designed to evaluate VLMs at the spatial limits of native Earth observation imagery. UHR-Micro spans diverse micro-target scales, context requirements, task families, and visual conditions, and provides diagnostic annotations that support controlled evaluation and fine-grained error attribution. Experiments with representative high-resolution VLMs show substantial failures in spatial grounding and evidence parsing, despite access to high-resolution inputs. Further analysis suggests that these failures are not fully resolved by increasing model capacity, but are closely tied to insufficient guidance in locating and using task-relevant micro-evidence. Motivated by this finding, we propose Micro-evidence Active Perception (MAP), a reference agent that decomposes queries into evidence-seeking steps, actively inspects candidate regions, and grounds its answers in localized observations. MAP-Agent improves micro-level perception by making high-resolution reasoning evidence-centered rather than image-centered. Together, UHR-Micro and MAP-Agent provide a diagnostic platform for evaluating, understanding, and advancing high-resolution reasoning in Earth observation VLMs. Datasets and source code were released at https://github.com/MiliLab/UHR-Micro.</summary>
    <content type="html">&lt;p&gt;&lt;strong&gt;Code:&lt;/strong&gt; &lt;a href="https://github.com/MiliLab/UHR-Micro"&gt;https://github.com/MiliLab/UHR-Micro&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Category:&lt;/strong&gt; Method&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Tasks:&lt;/strong&gt; VG&lt;/p&gt;</content>
    <category term="Computer Vision" />
  </entry>
</feed>