<?xml version='1.0' encoding='utf-8'?>
<feed xmlns="http://www.w3.org/2005/Atom">
  <title>RS-Paper-Hub — All Papers</title>
  <id>https://rspaper.top/output/feed.xml</id>
  <link href="https://rspaper.top/output/feed.xml" rel="self" type="application/atom+xml" />
  <link href="https://rspaper.top" rel="alternate" type="text/html" />
  <updated>2026-05-18T02:08:55Z</updated>
  <subtitle>Latest remote sensing papers (last 7 days) — 25 entries</subtitle>
  <author>
    <name>RS-Paper-Hub</name>
    <uri>https://rspaper.top</uri>
  </author>
  <entry>
    <title>Energy Evolution from the Chromosphere to the Heliosphere in the 2021 October 28 Solar Eruption</title>
    <link href="http://arxiv.org/abs/2605.16111v1" rel="alternate" type="text/html" />
    <id>http://arxiv.org/abs/2605.16111v1</id>
    <published>2026-05-15T00:00:00Z</published>
    <updated>2026-05-15T00:00:00Z</updated>
    <author>
      <name>Katharine K. Reeves</name>
    </author>
    <author>
      <name>Daniel B. Seaton</name>
    </author>
    <author>
      <name>Cynthia Cattell</name>
    </author>
    <author>
      <name>Bin Chen</name>
    </author>
    <author>
      <name>Liam David</name>
    </author>
    <author>
      <name>Federico Fraschetti</name>
    </author>
    <author>
      <name>Joe Giacalone</name>
    </author>
    <author>
      <name>Phillip Hess</name>
    </author>
    <author>
      <name>Andryi Koval</name>
    </author>
    <author>
      <name>Dana W. Longcope</name>
    </author>
    <author>
      <name>Surajit Mondal</name>
    </author>
    <author>
      <name>Christopher S. Moore</name>
    </author>
    <author>
      <name>Sophie Musset</name>
    </author>
    <author>
      <name>Tatiana Niembro</name>
    </author>
    <author>
      <name>Daniel Pacheco</name>
    </author>
    <author>
      <name>Yeimy J. Rivera</name>
    </author>
    <author>
      <name>Soumya Roy</name>
    </author>
    <author>
      <name>Xudong Sun</name>
    </author>
    <author>
      <name>Durgesh Tripathi</name>
    </author>
    <author>
      <name>Domenico Trotta</name>
    </author>
    <author>
      <name>Matthew J. West</name>
    </author>
    <author>
      <name>Sijie Yu</name>
    </author>
    <author>
      <name>Chunming Zhu</name>
    </author>
    <summary type="text">We perform a detailed study of the energetics for a well-observed solar eruption and flare that occurred on 28 October 2021. This event included a GOES class X1.0 flare, a global EUV wave, and a coronal mass ejection that reached speeds of &gt;2000 km/s. The event was observed from a variety of spacecraft in NASA's Heliophysics System Observatory, including multiple missions near Earth, STEREO-A off the Sun-Earth line, and Solar Orbiter, near the Sun-Earth line at about 0.8 au. Using remote sensing, in situ observations, and in some cases scaling laws based on previous observations, we characterize the following quantities: free magnetic energy, energy in non-thermal electrons, energy in non-thermal ions, bolometric energy, energy deposited in the chromosphere, thermal energy radiated in the flare loops, energy dissipated by the EUV wave, CME kinetic and gravitational potential energy, CME energy flux in the heliosphere, and the energy partition in the CME shock. We find that the total energy released during the event is consistent with estimates of the pre-event stored magnetic energy, and the CME kinetic + potential energy dominates the energy partition.</summary>
    <content type="html">&lt;p&gt;&lt;strong&gt;Publication:&lt;/strong&gt; 34 pages, 32 figures&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Category:&lt;/strong&gt; Method&lt;/p&gt;</content>
    <category term="astro-ph.SR" />
  </entry>
  <entry>
    <title>REX-SUB: A Scalable Subsampling Strategy for Modeling Large Spatial Datasets</title>
    <link href="http://arxiv.org/abs/2605.16075v1" rel="alternate" type="text/html" />
    <id>http://arxiv.org/abs/2605.16075v1</id>
    <published>2026-05-15T00:00:00Z</published>
    <updated>2026-05-15T00:00:00Z</updated>
    <author>
      <name>Nicholas Rios</name>
    </author>
    <author>
      <name>Ben Seiyon Lee</name>
    </author>
    <summary type="text">Recent advances in data collection technologies have led to the emergence of massive spatial datasets, with measurements obtained at millions of spatial locations. Geostatistical models typically employ Gaussian processes (GPs) to capture spatial dependence, but standard GP fitting becomes prohibitive at such scales. A promising solution is optimal subsampling, where a subset of locations is selected that optimizes a criterion. In this study, we propose a randomized exchange algorithm for subsampling (REX-SUB) which efficiently selects small subsamples that minimize prediction errors in the fitted spatial GP models. To further improve computational efficiency, we embed a scalable Vecchia approximation to the GP's joint likelihood, which takes advantage of sparsity in the precision matrix to enable fast inference on the selected subsamples. Through a simulation study and an application to a remotely sensed precipitable water dataset, we show that REX-SUB yields lower mean squared prediction errors and interval scores compared to competing subsampling strategies.</summary>
    <content type="html">&lt;p&gt;&lt;strong&gt;Category:&lt;/strong&gt; Method&lt;/p&gt;</content>
    <category term="stat.ME" />
    <category term="stat.CO" />
  </entry>
  <entry>
    <title>ChronoEarth-492K: A Large Scale and Long Horizon Spatiotemporal Hyperspectral Earth Observation Dataset and Benchmark</title>
    <link href="http://arxiv.org/abs/2605.15666v1" rel="alternate" type="text/html" />
    <id>http://arxiv.org/abs/2605.15666v1</id>
    <published>2026-05-15T00:00:00Z</published>
    <updated>2026-05-15T00:00:00Z</updated>
    <author>
      <name>Haozhe Si</name>
    </author>
    <author>
      <name>Yuxuan Wan</name>
    </author>
    <author>
      <name>Yuqing Wang</name>
    </author>
    <author>
      <name>Minh Do</name>
    </author>
    <author>
      <name>Han Zhao</name>
    </author>
    <summary type="text">Hyperspectral imaging (HSI) provides dense spectral information for the Earth's surface, enabling material-level understanding of land cover and ecosystem dynamics. Despite recent progress in hyperspectral self-supervised learning (SSL), existing datasets remain temporally shallow, limiting the development of long-horizon spatiotemporal modeling. To address this gap, we introduce ChronoEarth-492K, the first large-scale, temporally calibrated hyperspectral SSL dataset built upon NASA's EO-1 Hyperion mission, the world's longest continuous hyperspectral archive up to date (2001-2017). ChronoEarth-492K comprises 492,354 radiometrically harmonized patches across 185,398 global locations over 17 years, with 28,786 sites containing multi-temporal sequences ($\geq 3$ observations) that enable both short- and long-horizon temporal analysis. Building on this foundation, we establish the ChronoEarth-Benchmark, a unified evaluation suite spanning static, short-horizon, and long-horizon temporal tasks, constructed from six open-source geospatial products covering land cover, crop type, forest dynamics, and soil properties. We further introduce a standardized evaluation protocol and report extensive baseline results across state-of-the-art hyperspectral foundation models. Together, ChronoEarth and benchmark provide the first large-scale, temporally grounded platform for systematic spatiotemporal hyperspectral representation learning.</summary>
    <content type="html">&lt;p&gt;&lt;strong&gt;Category:&lt;/strong&gt; Dataset&lt;/p&gt;</content>
    <category term="Computer Vision" />
  </entry>
  <entry>
    <title>LDGuid: A Framework for Robust Change Detection via Latent Difference Guidance</title>
    <link href="http://arxiv.org/abs/2605.15582v1" rel="alternate" type="text/html" />
    <id>http://arxiv.org/abs/2605.15582v1</id>
    <published>2026-05-15T00:00:00Z</published>
    <updated>2026-05-15T00:00:00Z</updated>
    <author>
      <name>Jiaxuan Zhao</name>
    </author>
    <author>
      <name>Ali Bereyhi</name>
    </author>
    <summary type="text">Modern deep learning models for change detection (CD) often struggle to explicitly represent task-relevant semantic differences. This paper proposes the Latent Difference Guidance (LDGuid) framework that explicitly learns and injects semantic differences into CD models. LDGuid deploys adversarial autoencoding to implement a difference embedding (DE) module. The DE module is pretrained via the information bottleneck method, restricting it to learn only task-relevant differences between pre- and post-event samples. The learned latent difference is then used as an explicit guidance signal in the CD model. We validate LDGuid by integrating it into U-Net, BIT, and AERNet baselines for CD and evaluating it on LEVIR-CD, WHU-CD, SVCD, and CaBuAr datasets. Experimental results show that LDGuid enhances segmentation performance across all benchmarks, with particularly remarkable gains in challenging settings affected by spectral noise. The results further highlight the ability of LDGuid in incorporating domain knowledge, such as task-specific spectral indices. Our findings suggest that semantic difference learning can drastically enhance the robustness of CD in remote sensing.</summary>
    <content type="html">&lt;p&gt;&lt;strong&gt;Publication:&lt;/strong&gt; Accepted to IGARSS 2026. Code is available at: https://github.com/zjxyoyo/LDGuid&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Category:&lt;/strong&gt; Method&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Tasks:&lt;/strong&gt; CD&lt;/p&gt;</content>
    <category term="Computer Vision" />
  </entry>
  <entry>
    <title>Text-RSIR: A Text-Guided Framework for Efficient Remote Sensing Image Transmission and Reconstruction</title>
    <link href="http://arxiv.org/abs/2605.15558v1" rel="alternate" type="text/html" />
    <id>http://arxiv.org/abs/2605.15558v1</id>
    <published>2026-05-15T00:00:00Z</published>
    <updated>2026-05-15T00:00:00Z</updated>
    <author>
      <name>Hao Yang</name>
    </author>
    <author>
      <name>Xianping Ma</name>
    </author>
    <author>
      <name>Peifeng Ma</name>
    </author>
    <author>
      <name>Man-On Pun</name>
    </author>
    <summary type="text">High-resolution remote sensing imagery is critical for environmental monitoring, urban mapping, and land cover analysis, but its transmission is often hindered by limited bandwidth and high communication costs. Conventional pipelines transmit full-resolution pixel data, resulting in redundant and inefficient delivery. This paper proposes a text-guided remote sensing image transmission system that replaces complete high-resolution data with low-resolution images accompanied by compact textual descriptions. An onboard text generator produces spatial and semantic summaries, reducing the transmitted data volume to approximately 2\% of the original size. For ground-based reconstruction, a text-conditioned image restoration model is introduced, which leverages cross-modal learning to recover fine spatial details and maintain semantic coherence. Experimental results on the Alsat-2B, UC Merced Land Use, and Aerial Image datasets demonstrate that the proposed framework achieves reconstruction PSNRs of 16.36 dB, 26.87 dB, and 27.41 dB, respectively, enabling efficient and information-preserving image transfer for remote sensing applications. The implementation will be made publicly available at \href{https://github.com/haoyangofficial/textrssr}{GitHub}.</summary>
    <content type="html">&lt;p&gt;&lt;strong&gt;Code:&lt;/strong&gt; &lt;a href="https://github.com/haoyangofficial/textrssr"&gt;https://github.com/haoyangofficial/textrssr&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Publication:&lt;/strong&gt; 15 pages, 8 figures&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Category:&lt;/strong&gt; Method&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Tasks:&lt;/strong&gt; SR&lt;/p&gt;</content>
    <category term="Image and Video Processing" />
    <category term="Computer Vision" />
  </entry>
  <entry>
    <title>TERRA-CD: Multi-Temporal Framework for Multi-class and Semantic Change Detection</title>
    <link href="http://arxiv.org/abs/2605.14651v1" rel="alternate" type="text/html" />
    <id>http://arxiv.org/abs/2605.14651v1</id>
    <published>2026-05-14T00:00:00Z</published>
    <updated>2026-05-14T00:00:00Z</updated>
    <author>
      <name>Omkar Oak</name>
    </author>
    <author>
      <name>Rukmini Nazre</name>
    </author>
    <author>
      <name>Rujuta Budke</name>
    </author>
    <author>
      <name>Suraj Sawant</name>
    </author>
    <summary type="text">Urban vegetation monitoring plays a vital role in understanding environmental changes, yet comprehensive datasets for this purpose remain limited. To address this gap, we present the Temporal Remote-sensing Repository for Analyzing Change Detection (TERRA-CD), a benchmark dataset comprising 5,221 Sentinel-2 image pairs from 2019 and 2024, covering 232 cities across the USA and Europe. The dataset features three distinct annotation schemes: 4-class land cover mapping masks, 3-class vegetation change masks, and 13-class semantic change masks capturing all possible land cover transitions. Using various deep learning approaches including Siamese networks, STANet variants, Bi-SRNet, Changemask, Post-Classification Comparison, and HRSCD strategies, we evaluated the dataset's effectiveness for both vegetation Multi-class Change Detection as well as Semantic Change Detection. The proposed dataset and methods are available at https://github.com/omkarsoak/TERRA-CD.</summary>
    <content type="html">&lt;p&gt;&lt;strong&gt;Code:&lt;/strong&gt; &lt;a href="https://github.com/omkarsoak/TERRA-CD"&gt;https://github.com/omkarsoak/TERRA-CD&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Publication:&lt;/strong&gt; Paper presented at 11th International Congress on Information and Communication Technology (ICICT) 2026, London&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Category:&lt;/strong&gt; Method&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Tasks:&lt;/strong&gt; CLS;CD&lt;/p&gt;</content>
    <category term="Computer Vision" />
  </entry>
  <entry>
    <title>ArcGate: Adaptive Arctangent Gated Activation</title>
    <link href="http://arxiv.org/abs/2605.14518v1" rel="alternate" type="text/html" />
    <id>http://arxiv.org/abs/2605.14518v1</id>
    <published>2026-05-14T00:00:00Z</published>
    <updated>2026-05-14T00:00:00Z</updated>
    <author>
      <name>Avik Bhattacharya</name>
    </author>
    <author>
      <name>Siddhant Dnyanesh Gole</name>
    </author>
    <author>
      <name>Subhasis Chaudhuri</name>
    </author>
    <author>
      <name>Alejandro C. Frery</name>
    </author>
    <author>
      <name>Biplab Banerjee</name>
    </author>
    <summary type="text">Activation functions are central to deep networks, influencing non-linearity, feature learning, convergence, and robustness. This paper proposes the Adaptive Arctangent Gated Activation (ArcGate) function, a flexible formulation that generates a broad spectrum of activation shapes via a three-stage non-linear transformation. Unlike conventional fixed-shape activations such as ReLU, GELU, or SiLU, ArcGate uses seven learnable parameters per layer, allowing the neural network to autonomously optimize its non-linearity to the specific requirements of the feature hierarchy and data distribution. We evaluate ArcGate using ResNet-50 and Vision Transformer (ViT-B/16) architectures on three widely used remote sensing benchmarks: PatternNet, UC Merced Land Use, and the 13-band EuroSAT MSI multispectral dataset. Experimental results show that ArcGate consistently outperforms standard baselines, achieving a peak overall accuracy of 99.67% on PatternNet. Most notably, ArcGate exhibits superior structural resilience in noisy environments, maintaining a 26.65% performance lead over ReLU under moderate Gaussian noise (standard deviation 0.1). Analysis of the learned parameters reveals a depth-dependent functional evolution, where the model increases gating strength in deeper layers to enhance signal propagation. These findings suggest that ArcGate is a robust and adaptive general node activation function for high-resolution earth observation tasks.</summary>
    <content type="html">&lt;p&gt;&lt;strong&gt;Category:&lt;/strong&gt; Method&lt;/p&gt;</content>
    <category term="Computer Vision" />
    <category term="Machine Learning" />
  </entry>
  <entry>
    <title>GeoVista: Visually Grounded Active Perception for Ultra-High-Resolution Remote Sensing Understanding</title>
    <link href="http://arxiv.org/abs/2605.14475v1" rel="alternate" type="text/html" />
    <id>http://arxiv.org/abs/2605.14475v1</id>
    <published>2026-05-14T00:00:00Z</published>
    <updated>2026-05-14T00:00:00Z</updated>
    <author>
      <name>Jiashun Zhu</name>
    </author>
    <author>
      <name>Ronghao Fu</name>
    </author>
    <author>
      <name>Jiasen Hu</name>
    </author>
    <author>
      <name>Nachuan Xing</name>
    </author>
    <author>
      <name>Xu Na</name>
    </author>
    <author>
      <name>Xiao Yang</name>
    </author>
    <author>
      <name>Zhiwen Lin</name>
    </author>
    <author>
      <name>Weipeng Zhang</name>
    </author>
    <author>
      <name>Lang Sun</name>
    </author>
    <author>
      <name>Zhiheng Xue</name>
    </author>
    <author>
      <name>Haoran Liu</name>
    </author>
    <author>
      <name>Weijie Zhang</name>
    </author>
    <author>
      <name>Bo Yang</name>
    </author>
    <summary type="text">Interpreting ultra-high-resolution (UHR) remote sensing images requires models to search for sparse and tiny visual evidence across large-scale scenes. Existing remote sensing vision-language models can inspect local regions with zooming and cropping tools, but most exploration strategies follow either a one-shot focus or a single sequential trajectory. Such single-path exploration can lose global context, leave scattered regions unvisited, and revisit or count the same evidence multiple times. To this end, we propose GeoVista, a planning-driven active perception framework for UHR remote sensing interpretation. Instead of committing to one zooming path, GeoVista first builds a global exploration plan, then verifies multiple candidate regions through branch-wise local inspection, while maintaining an explicit evidence state for cross-region aggregation and de-duplication. To enable this behavior, we introduce APEX-GRO, a cold-start supervised trajectory corpus that reformulates diverse UHR tasks as Global-Region-Object interactive reasoning processes with a unified, scale-invariant spatial representation. We further design an Observe-Plan-Track mechanism for global observation, adaptive region inspection, and evidence tracking, and align the model with a GRPO-based strategy using step-wise rewards for planning, localization, and final answer correctness. Experiments on RSHR-Bench, XLRS-Bench, and LRS-VQA show that GeoVista achieves state-of-the-art performance. Code and dataset are available at https://github.com/ryan6073/GeoVista</summary>
    <content type="html">&lt;p&gt;&lt;strong&gt;Code:&lt;/strong&gt; &lt;a href="https://github.com/ryan6073/GeoVista"&gt;https://github.com/ryan6073/GeoVista&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Category:&lt;/strong&gt; Method&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Tasks:&lt;/strong&gt; VQA&lt;/p&gt;</content>
    <category term="Computer Vision" />
  </entry>
  <entry>
    <title>GeoViSTA: Geospatial Vision-Tabular Transformer for Multimodal Environment Representation</title>
    <link href="http://arxiv.org/abs/2605.14406v1" rel="alternate" type="text/html" />
    <id>http://arxiv.org/abs/2605.14406v1</id>
    <published>2026-05-14T00:00:00Z</published>
    <updated>2026-05-14T00:00:00Z</updated>
    <author>
      <name>Yuhao Liu</name>
    </author>
    <author>
      <name>Sadeer Al-Kindi</name>
    </author>
    <author>
      <name>Ashok Veeraraghavan</name>
    </author>
    <author>
      <name>Guha Balakrishnan</name>
    </author>
    <summary type="text">Large-scale pretraining on Earth observation imagery has yielded powerful representations of the natural and built environment. However, most existing geospatial foundation models do not directly model the structured socioeconomic covariates typically stored in tabular form. This modality gap limits their ability to capture the complete total environment, which is critical for reasoning about complex environmental, social, and health-related outcomes. In this work, we propose GeoViSTA (Geospatial Vision-Tabular Transformer), a vision-tabular architecture that learns unified geospatial embeddings from co-registered gridded imagery and tabular data. GeoViSTA utilizes bilateral cross-attention to exchange spatial and semantic information across modalities, guided by a geography-aware attention mechanism that aligns continuous image patches with irregular census-tract tokens. We train GeoViSTA with a self-supervised joint masked-autoencoding objective, forcing it to recover missing image patches and tabular rows using local spatial context and cross-modal cues. Empirically, GeoViSTA's unified embeddings improve linear probing performance on high-impact downstream tasks, outperforming baselines in predicting disease-specific mortality and fire hazard frequency across held-out regions. These results demonstrate that jointly modeling the physical environment alongside structured socioeconomic context yields highly transferable representations for holistic geospatial inference.</summary>
    <content type="html">&lt;p&gt;&lt;strong&gt;Category:&lt;/strong&gt; Method&lt;/p&gt;</content>
    <category term="Machine Learning" />
    <category term="Computer Vision" />
  </entry>
  <entry>
    <title>AnyBand-Diff: A Unified Remote Sensing Image Generation and Band Repair Framework with Spectral Priors</title>
    <link href="http://arxiv.org/abs/2605.14341v1" rel="alternate" type="text/html" />
    <id>http://arxiv.org/abs/2605.14341v1</id>
    <published>2026-05-14T00:00:00Z</published>
    <updated>2026-05-14T00:00:00Z</updated>
    <author>
      <name>Zuopeng Zhao</name>
    </author>
    <author>
      <name>Ying Liu</name>
    </author>
    <author>
      <name>Xiaoyu Li</name>
    </author>
    <author>
      <name>Su Luo</name>
    </author>
    <author>
      <name>Lu Li</name>
    </author>
    <author>
      <name>Wenwen Liu</name>
    </author>
    <summary type="text">Existing diffusion models have made significant progress in generating realistic images. However, their direct adaptation to remote sensing imagery often disregards intrinsic physical laws. This oversight frequently leads to spectral distortion and radiometric inconsistency, severely limiting the scientific utility of generated data. To address this issue, this paper introduces AnyBand-Diff, a novel spectral-prior-guided diffusion framework tailored for robust spectral reconstruction. Specifically, we design a Masked Conditional Diffusion backbone integrated with a dual stochastic masking strategy, empowering the model to recover complete spectral information from arbitrary band subsets. Subsequently, to ensure radiometric fidelity, a Physics-Guided Sampling mechanism is proposed, leveraging gradients from a differentiable physical model to explicitly steer the denoising trajectory toward the manifold of physically plausible solutions. Furthermore, a Multi-Scale Physical Loss is formulated to enforce rigorous constraints across pixel, region, and global levels in a joint manner. Extensive experiments confirm the effectiveness of AnyBand-Diff in generating reliable imagery and achieving accurate spectral reconstruction, contributing to the advancement of physics-aware generative methods for Earth observation.</summary>
    <content type="html">&lt;p&gt;&lt;strong&gt;Category:&lt;/strong&gt; Method&lt;/p&gt;</content>
    <category term="Computer Vision" />
  </entry>
  <entry>
    <title>D2-CDIG: Controlled Diffusion Remote Sensing Image Generation with Dual Priors of DEM and Cloud-Fog</title>
    <link href="http://arxiv.org/abs/2605.14326v1" rel="alternate" type="text/html" />
    <id>http://arxiv.org/abs/2605.14326v1</id>
    <published>2026-05-14T00:00:00Z</published>
    <updated>2026-05-14T00:00:00Z</updated>
    <author>
      <name>Zuopeng Zhao</name>
    </author>
    <author>
      <name>Ying Liu</name>
    </author>
    <author>
      <name>Kanyaphakphachsorn Pharksuwan</name>
    </author>
    <author>
      <name>Su Luo</name>
    </author>
    <author>
      <name>Xiaoyu Li</name>
    </author>
    <author>
      <name>Maocai Ning</name>
    </author>
    <summary type="text">Remote sensing image generation provides a reliable data foundation for remote sensing large models and downstream tasks. However, existing controllable remote sensing image generation methods typically rely on traditional techniques such as segmentation and edge detection, which do not fully leverage terrain or atmospheric conditions. As a result, the generated images often lack accuracy and naturalness when dealing with complex terrains and atmospheric phenomena. In this paper, we propose a novel remote sensing image generation framework, D2-CDIG, which integrates diffusion models with a dual-prior control mechanism. By incorporating both Digital Elevation Model (DEM) and cloud-fog information as dual prior knowledge, D2-CDIG precisely controls ground features and atmospheric phenomena within the generated images. Specifically, D2-CDIG decouples the terrain and atmospheric generation processes through independent control of ground and atmospheric branches. Additionally, a refined cloud-fog slider is introduced to flexibly adjust cloud thickness and distribution. During training, ground and atmospheric control signals are injected in layers to ensure a seamless transition within the images. Compared to traditional methods based on segmentation or edge detection, D2-CDIG shows significant improvements in image quality, detail richness, and realism. D2-CDIG offers a flexible and precise solution for remote sensing image generation, providing high-quality data for training large remote sensing models and downstream tasks.</summary>
    <content type="html">&lt;p&gt;&lt;strong&gt;Category:&lt;/strong&gt; Method&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Tasks:&lt;/strong&gt; 3D&lt;/p&gt;</content>
    <category term="Computer Vision" />
  </entry>
  <entry>
    <title>Quantum optical synthesis of high-dimensional ultrafast frequency-bin qudits</title>
    <link href="http://arxiv.org/abs/2605.14314v1" rel="alternate" type="text/html" />
    <id>http://arxiv.org/abs/2605.14314v1</id>
    <published>2026-05-14T00:00:00Z</published>
    <updated>2026-05-14T00:00:00Z</updated>
    <author>
      <name>Prasad Koviri</name>
    </author>
    <author>
      <name>Tomoya Okita</name>
    </author>
    <author>
      <name>Rina Yabumoto</name>
    </author>
    <author>
      <name>Yuta Fujihashi</name>
    </author>
    <author>
      <name>Masahiro Yabuno</name>
    </author>
    <author>
      <name>Hirotaka Terai</name>
    </author>
    <author>
      <name>Shigehito Miki</name>
    </author>
    <author>
      <name>Kali P. Nayak</name>
    </author>
    <author>
      <name>Ryosuke Shimizu</name>
    </author>
    <summary type="text">Frequency modes of light are one of the most promising platforms that provide access to high-dimensional quantum states amongst different photonic degrees of freedom capable of high-dimensionality, enabling robust, error-tolerant, and scalable quantum optical information systems. We demonstrate engineering of precisely controlled two-photon high-dimensional states entangled in frequency through time-domain Fourier optical synthesis. We generate and convert a continuous broadband frequency-entangled state into a large range of discrete frequency bins suitable for ITU standards, with spacings ranging from 12.5 GHz to 750 GHz, and observe spectral anticorrelations over 38 frequency bins, including intra-bin pure states at a 100 GHz bin spacing. We characterize the full quantum state dimensionality via Schmidt decomposition and observe lower bounds on the frequency-binned Hilbert-space dimensionalities of at least 289, formed by two entangled qudits with dimension 17. Furthermore, we demonstrate quantum nonlocality via frequency correlations in a transmission experiment over a campus-scale two-node fiber network. This work represents a crucial step towards building a versatile and relatively simple way of generating precisely controlled high-dimensional spectral qudits, with the potential of harnessing in wavelength-multiplexed quantum networks, high-dimensional information processing, and communication of quantum states specifically, and fiber-optic quantum remote sensing.</summary>
    <content type="html">&lt;p&gt;&lt;strong&gt;Publication:&lt;/strong&gt; 18 pages and 6 figures. The first two listed authors contributed equally to this work&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Category:&lt;/strong&gt; Method&lt;/p&gt;</content>
    <category term="quant-ph" />
    <category term="physics.optics" />
  </entry>
  <entry>
    <title>HiSem: Hierarchical Semantic Disentangling for Remote Sensing Image Change Captioning</title>
    <link href="http://arxiv.org/abs/2605.15024v1" rel="alternate" type="text/html" />
    <id>http://arxiv.org/abs/2605.15024v1</id>
    <published>2026-05-14T00:00:00Z</published>
    <updated>2026-05-14T00:00:00Z</updated>
    <author>
      <name>Man Wang</name>
    </author>
    <author>
      <name>Chenyang Liu</name>
    </author>
    <author>
      <name>Wenjun Li</name>
    </author>
    <author>
      <name>Feng Ni</name>
    </author>
    <author>
      <name>Bing Jia</name>
    </author>
    <author>
      <name>Baoqi Huang</name>
    </author>
    <author>
      <name>Riting Xia</name>
    </author>
    <author>
      <name>Zhenwei Shi</name>
    </author>
    <summary type="text">Remote sensing image change captioning (RSICC) aims to achieve high-level semantic understanding of genuine changes occurring between bi-temporal images. Despite notable progress, existing methods are fundamentally limited by a shared modeling assumption: changed and unchanged image pairs, which have intrinsically different semantic granularities, are processed under a unified modeling strategy. This modeling inconsistency leads to semantic entanglement between coarse-grained change existence judgment and fine-grained semantic understanding.To address the above limitation, we propose a novel hierarchical semantic disentangling network (HiSem) that explicitly disentangles semantic representations of different granularities. Specifically, we first introduce the Bidirectional Differential Attention Modulation (BDAM) module that leverages discrepancy-aware attention to enhance cross-temporal interactions, thereby amplifying true change signals while suppressing irrelevant variations. Building upon this, we design a Hierarchical Adaptive Semantic Disentanglement (HASD) module that performs adaptive routing at two hierarchical levels: a coarse-grained image-level routing mechanism distinguishes changed and unchanged image pairs, while a fine-grained token-level Mixture-of-Experts (MoE) block models diverse and heterogeneous change semantics for changed samples. Extensive experiments on two benchmark datasets demonstrate that HiSem outperfoms previous methods, achieving a significant improvement of +7.52\% BLEU-4 on the WHU-CDC dataset. More importantly, our approach provides a structured perspective for RSICC by explicitly aligning model design with the intrinsic semantic heterogeneity of bi-temporal scenes. The code will be available at https://github.com/Man-Wang-star/HiSem</summary>
    <content type="html">&lt;p&gt;&lt;strong&gt;Code:&lt;/strong&gt; &lt;a href="https://github.com/Man-Wang-star/HiSem"&gt;https://github.com/Man-Wang-star/HiSem&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Category:&lt;/strong&gt; Method&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Tasks:&lt;/strong&gt; IC;CD&lt;/p&gt;</content>
    <category term="Computer Vision" />
  </entry>
  <entry>
    <title>ELDOR: A Dataset and Benchmark for Illegal Gold Mining in the Amazon Rainforest</title>
    <link href="http://arxiv.org/abs/2605.15397v1" rel="alternate" type="text/html" />
    <id>http://arxiv.org/abs/2605.15397v1</id>
    <published>2026-05-14T00:00:00Z</published>
    <updated>2026-05-14T00:00:00Z</updated>
    <author>
      <name>Kangning Cui</name>
    </author>
    <author>
      <name>Surendra Bohara</name>
    </author>
    <author>
      <name>Suraj Prasai</name>
    </author>
    <author>
      <name>Zishan Shao</name>
    </author>
    <author>
      <name>Wei Tang</name>
    </author>
    <author>
      <name>Martin Pillaca</name>
    </author>
    <author>
      <name>Edwin Flores</name>
    </author>
    <author>
      <name>Zhen Yang</name>
    </author>
    <author>
      <name>Gregory Larsen</name>
    </author>
    <author>
      <name>Evan Dethier</name>
    </author>
    <author>
      <name>David Lutz</name>
    </author>
    <author>
      <name>Jean-Michel Morel</name>
    </author>
    <author>
      <name>Miles Silman</name>
    </author>
    <author>
      <name>Victor Pauca</name>
    </author>
    <author>
      <name>Fan Yang</name>
    </author>
    <summary type="text">Illegal gold mining in the Amazon rainforest causes deforestation, water contamination, and long-term ecosystem disruption, yet remains difficult to monitor at fine spatial scales. Satellite imagery supports large-scale observation, but often misses small mining-related structures and subtle land-cover transitions, especially under frequent cloud cover. We introduce ELDOR, a large-scale UAV benchmark for monitoring environmental and landscape disturbance from illegal gold mining in the rainforest. ELDOR contains manually annotated orthomosaic imagery covering over 2,500 hectares, with pixel-level semantic labels for both mining-related activities and surrounding ecological structures. With this unified annotation source, we establish four benchmark tasks: semantic segmentation, segmentation-derived recognition, direct multi-label classification, and class-presence recognition with vision-language models. Across these tasks, we compare generic and remote-sensing-specific segmentation models, vision foundation model-related segmentation methods, direct multi-label classification methods, and vision-language models under a controlled closed-set protocol. Results show that current methods still struggle with rare small-scale mining structures and fine-grained recovery classes, suggesting the need for context-aware and multimodal modeling. To support domain analysis and practical use, we further build an interactive explorer for domain experts that provides a unified interface for data exploration and model inference.</summary>
    <content type="html">&lt;p&gt;&lt;strong&gt;Publication:&lt;/strong&gt; 70 pages, 35 figures&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Category:&lt;/strong&gt; Dataset&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Tasks:&lt;/strong&gt; CLS;SEG&lt;/p&gt;</content>
    <category term="Computer Vision" />
  </entry>
  <entry>
    <title>ChangeFlow -- Latent Rectified Flow for Change Detection in Remote Sensing</title>
    <link href="http://arxiv.org/abs/2605.15375v1" rel="alternate" type="text/html" />
    <id>http://arxiv.org/abs/2605.15375v1</id>
    <published>2026-05-14T00:00:00Z</published>
    <updated>2026-05-14T00:00:00Z</updated>
    <author>
      <name>Blaž Rolih</name>
    </author>
    <author>
      <name>Matic Fučka</name>
    </author>
    <author>
      <name>Filip Wolf</name>
    </author>
    <author>
      <name>Luka Čehovin Zajc</name>
    </author>
    <summary type="text">Remote sensing change detection (RSCD) aims to localise changes between two images of the same geographic region. In practice, change masks often follow region-level annotation conventions rather than purely local appearance differences, making them context-dependent and occasionally ambiguous. Most state-of-the-art methods utilise per-pixel discriminative classification, which produces a single prediction per input and fails to explicitly model the changed region as a coherent whole. A natural alternative is generative formulation, which can model a distribution of plausible masks, enabling sampling to capture ambiguity and encourage global consistency. However, existing generative RSCD approaches typically lag behind strong discriminative baselines due to the high computational cost of pixel-space generation and the complexity of their conditioning mechanisms. To address the limitations of prior discriminative and generative methods, we propose ChangeFlow, a generative framework that reformulates change detection as the synthesis of a change mask in latent space via rectified flow. ChangeFlow is guided by a structured yet lightweight conditioning signal, and its stochastic design naturally supports sampling-based prediction ensembling. Namely, aggregating multiple predicted change masks improves robustness, while sample agreement provides a practical confidence estimation that highlights ambiguous regions. Across four benchmarks, ChangeFlow achieves an average F1 of 80.4\%, improving by 1.3 points on average over the previous best method, while maintaining inference speed comparable to recent strong baselines. Project page: https://blaz-r.github.io/changeflow_cd</summary>
    <content type="html">&lt;p&gt;&lt;strong&gt;Category:&lt;/strong&gt; Method&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Tasks:&lt;/strong&gt; CLS;CD&lt;/p&gt;</content>
    <category term="Computer Vision" />
    <category term="Artificial Intelligence" />
  </entry>
  <entry>
    <title>Multimodal Object Detection Under Sparse Forest-Canopy Occlusion</title>
    <link href="http://arxiv.org/abs/2605.15326v1" rel="alternate" type="text/html" />
    <id>http://arxiv.org/abs/2605.15326v1</id>
    <published>2026-05-14T00:00:00Z</published>
    <updated>2026-05-14T00:00:00Z</updated>
    <author>
      <name>Nitik Jain</name>
    </author>
    <author>
      <name>Mangal Kothari</name>
    </author>
    <summary type="text">Reliable detection of humans beneath forest canopy remains a difficult remote-sensing challenge due to sparse, structured, and viewpoint-dependent occlusion. This paper presents a multimodal proof-of-concept pipeline that integrates three complementary approaches: (i) experimental evaluation of LiDAR returns through vegetation to assess the feasibility of active sensing, (ii) visible--thermal image fusion using a multi-scale transform and sparse-representation framework to enhance human saliency, and (iii) synthetic-aperture image formation via Airborne Optical Sectioning (AOS) to suppress canopy clutter. A YOLOv5 detector is fine-tuned on the Teledyne FLIR thermal dataset and evaluated on thermal and fused imagery. Results show that the tested terrestrial LiDAR configuration provides limited penetration for object-level detection, while visible--thermal fusion improves target visibility in low-contrast scenes and AOS enhances ground-plane detection in synthetic forest imagery. The fine-tuned YOLOv5 achieves a mean average precision of $\sim$0.83 on the top three FLIR classes. These findings establish an initial baseline for UAV-deployable search-and-rescue and surveillance systems operating in forested environments, and motivate future work on dedicated forest datasets and real-time multimodal integration.</summary>
    <content type="html">&lt;p&gt;&lt;strong&gt;Category:&lt;/strong&gt; Method&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Tasks:&lt;/strong&gt; OD&lt;/p&gt;</content>
    <category term="Computer Vision" />
  </entry>
  <entry>
    <title>On Hallucinations in Inverse Problems: Fundamental Limits and Provable Assessment Methods</title>
    <link href="http://arxiv.org/abs/2605.13146v1" rel="alternate" type="text/html" />
    <id>http://arxiv.org/abs/2605.13146v1</id>
    <published>2026-05-13T00:00:00Z</published>
    <updated>2026-05-13T00:00:00Z</updated>
    <author>
      <name>David Iagaru</name>
    </author>
    <author>
      <name>Nina M. Gottschling</name>
    </author>
    <author>
      <name>Anders C. Hansen</name>
    </author>
    <author>
      <name>Josselin Garnier</name>
    </author>
    <summary type="text">Artificial intelligence (AI) has transformed imaging inverse problems, from medical diagnostics to Earth observation. Yet deep neural networks can produce hallucinations, realistic-looking but incorrect details, undermining their reliability, especially when ground truth data is unavailable. We develop a theoretical framework showing that such hallucinations are not merely artifacts of particular models, but can arise from the ill-posed nature of the inverse problem itself. We derive necessary and sufficient conditions for hallucinations, together with computable bounds on their magnitude that depend only on the forward model. Building on this theory, we introduce algorithms to: (1) estimate the minimum hallucination magnitude achievable by any reconstruction model for a given input; (2) assess the faithfulness of reconstructed details by a given reconstruction model. Experiments across three imaging tasks demonstrate that our approach applies broadly, including to modern generative models, and provides a principled way to quantify and evaluate AI hallucinations.</summary>
    <content type="html">&lt;p&gt;&lt;strong&gt;Publication:&lt;/strong&gt; 31 pages, 11 figures&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Category:&lt;/strong&gt; Method&lt;/p&gt;</content>
    <category term="Machine Learning (Statistics)" />
    <category term="Computer Vision; Machine Learning" />
  </entry>
  <entry>
    <title>GHGbench: A Unified Multi-Entity, Multi-Task Benchmark for Carbon Emission Prediction</title>
    <link href="http://arxiv.org/abs/2605.13743v1" rel="alternate" type="text/html" />
    <id>http://arxiv.org/abs/2605.13743v1</id>
    <published>2026-05-13T00:00:00Z</published>
    <updated>2026-05-13T00:00:00Z</updated>
    <author>
      <name>Yifan Duan</name>
    </author>
    <author>
      <name>Siyuan Zheng</name>
    </author>
    <author>
      <name>Lihuan Li</name>
    </author>
    <author>
      <name>Chao Xue</name>
    </author>
    <author>
      <name>Flora Salim</name>
    </author>
    <summary type="text">Open datasets and benchmarks for entity-level carbon-emission prediction remain fragmented across access, scale, granularity, and evaluation. We introduce GHGbench, an open dataset and benchmark for company- and building-level greenhouse-gas prediction. The company track contains 32,000+ company-year records from 12,000+ firms with Scope 1+2 and Scope 3 disclosures and financial/sectoral signals; the building track harmonises 491,591 building-year records from 13 open sources into a single schema across 26 metropolitan areas (10 U.S., 15 Australian, 1 Singaporean), with climate covariates and multimodal remote-sensing embeddings. GHGbench defines canonical splits with in-distribution and cross-region/city transfer as primary tasks and temporal hold-out plus short-horizon forecasting as supplementary appendix evidence; headline baselines span gradient-boosted trees, a tabular foundation model, MLP, FT-Transformer, and multimodal fusion, with an LLM panel as auxiliary, all evaluated under multi-seed paired-bootstrap tests. Three benchmark-level findings emerge: (i) building emissions are structurally harder than company emissions; (ii) the in-distribution to out-of-distribution gap dwarfs any within-model gap across both the company track and the building track, and a tabular foundation model is, to our knowledge, the first baseline to open a paired-bootstrap-significant gap over tuned trees on a multi-city building-emissions task; (iii) multimodal remote-sensing embeddings help precisely where tabular generalisation breaks. GHGbench also exposes catastrophic city transfer and the sector-factor lookup ceiling as systematic failure modes. Code and reconstruction recipes are available at GHGbench.</summary>
    <content type="html">&lt;p&gt;&lt;strong&gt;Category:&lt;/strong&gt; Dataset&lt;/p&gt;</content>
    <category term="Machine Learning" />
  </entry>
  <entry>
    <title>High-order mid-infrared nonlinear topological differentiator</title>
    <link href="http://arxiv.org/abs/2605.13541v1" rel="alternate" type="text/html" />
    <id>http://arxiv.org/abs/2605.13541v1</id>
    <published>2026-05-13T00:00:00Z</published>
    <updated>2026-05-13T00:00:00Z</updated>
    <author>
      <name>Jixi Zhang</name>
    </author>
    <author>
      <name>Kun Huang</name>
    </author>
    <author>
      <name>Shina Liao</name>
    </author>
    <author>
      <name>Zhuohang Wei</name>
    </author>
    <author>
      <name>Jianan Fang</name>
    </author>
    <author>
      <name>Heping Zeng</name>
    </author>
    <summary type="text">High-order edge-enhanced imaging enables precise feature localization and effective background suppression, offering a powerful tool for real-time recognition and high-contrast visualization. Extending this capability to the mid-infrared (MIR) regime is particularly valuable for applications such as biomedical diagnostics, material inspection, and remote sensing, yet remains limited by inadequate spatial-frequency modulation fidelity and low detection sensitivity. Here, we demonstrate a high-sensitivity MIR upconversion differentiator operating at 3 $μ$m, which achieves isotropic high-order edge enhancement by optically imprinting topological complex-amplitude patterns onto MIR Fourier components via nonlinear parametric interaction. Vortex transfer functions $t(k_r, φ) \propto k_r^\ell e^{i\ellφ}$ are precisely encoded on a phase-only spatial light modulator to enable tunable MIR differentiation from first- to fourth- order, with real-time switching at up to 60 Hz. Benefiting from a low-noise upconversion process and a single-photon-sensitive silicon camera, the system achieves high-contrast edge imaging under low-light conditions. Experimental results confirm accurate edge extraction and background suppression for both amplitude and phase objects, hence underscoring its potential for noninvasive diagnostics and label-free material analysis.</summary>
    <content type="html">&lt;p&gt;&lt;strong&gt;Category:&lt;/strong&gt; Method&lt;/p&gt;</content>
    <category term="physics.optics" />
  </entry>
  <entry>
    <title>RS-Claw: Progressive Active Tool Exploration via Hierarchical Skill Trees for Remote Sensing Agents</title>
    <link href="http://arxiv.org/abs/2605.13391v1" rel="alternate" type="text/html" />
    <id>http://arxiv.org/abs/2605.13391v1</id>
    <published>2026-05-13T00:00:00Z</published>
    <updated>2026-05-13T00:00:00Z</updated>
    <author>
      <name>Liangtian Liu</name>
    </author>
    <author>
      <name>Zeyuan Wang</name>
    </author>
    <author>
      <name>Ziyu Li</name>
    </author>
    <author>
      <name>Kai Ouyang</name>
    </author>
    <author>
      <name>Zichao Tang</name>
    </author>
    <author>
      <name>Chengfu Liu</name>
    </author>
    <author>
      <name>Haifeng Li</name>
    </author>
    <author>
      <name>Hanwen Yu</name>
    </author>
    <author>
      <name>Wentao Yang</name>
    </author>
    <author>
      <name>Cheng Yang</name>
    </author>
    <author>
      <name>Dongyang Hou</name>
    </author>
    <summary type="text">The rise of multi-modal large language models (MLLMs) is shifting remote sensing (RS) intelligence from "see" to "action", as OpenClaw-style frameworks enable agents to autonomously operate massive RS image-processing tools for complex tasks. Existing RS agents adopt a passive selection paradigm for tool invocation, relying on either full tool registration (Flat) or retrieval-augmented generation (RAG). However, in the massive and multi-source heterogeneous RS tool ecosystem, such passive mechanisms struggle to dynamically balance "context load" and "toolset completeness" throughout task reasoning, thus exhibiting inherent limitations: full tool registration triggers context space deficits during long-horizon tasks, whereas RAG retrieval may omit critical tools in essential steps. To overcome these bottlenecks, this paper redefines tool selection by arguing that the agent should act as an active explorer within the tool space. Based on this perspective, we propose RS-Claw, a novel RS agent architecture. By leveraging Skill encapsulation technology at the tool end, this architecture hierarchically structures tool descriptions, enabling the agent to execute on-demand sequential decision-making: initially selecting relevant skill branches by reading only tool summaries, then dynamically loading detailed descriptions, and ultimately achieving precise invocation. This active paradigm not only significantly liberates the agent's context space but also effectively ensures the accurate hit rate of critical tools during long-horizon reasoning. Systematic experiments on the Earth-Bench benchmark demonstrate that RS-Claw's active exploration mechanism effectively filters semantic noise and substantially frees up reasoning space, achieving an input token compression ratio of up to 86%, and comprehensively outperforming existing Flat and RAG baselines across complex reasoning evaluations.</summary>
    <content type="html">&lt;p&gt;&lt;strong&gt;Category:&lt;/strong&gt; Method&lt;/p&gt;</content>
    <category term="Artificial Intelligence" />
  </entry>
  <entry>
    <title>Can LLM Agents Respond to Disasters? Benchmarking Heterogeneous Geospatial Reasoning in Emergency Operations</title>
    <link href="http://arxiv.org/abs/2605.11633v1" rel="alternate" type="text/html" />
    <id>http://arxiv.org/abs/2605.11633v1</id>
    <published>2026-05-12T00:00:00Z</published>
    <updated>2026-05-12T00:00:00Z</updated>
    <author>
      <name>Junjue Wang</name>
    </author>
    <author>
      <name>Weihao Xuan</name>
    </author>
    <author>
      <name>Heli Qi</name>
    </author>
    <author>
      <name>Pengyu Dai</name>
    </author>
    <author>
      <name>Kunyi Liu</name>
    </author>
    <author>
      <name>Hongruixuan Chen</name>
    </author>
    <author>
      <name>Zhuo Zheng</name>
    </author>
    <author>
      <name>Junshi Xia</name>
    </author>
    <author>
      <name>Stefano Ermon</name>
    </author>
    <author>
      <name>Naoto Yokoya</name>
    </author>
    <summary type="text">Operational disaster response goes beyond damage assessment, requiring responders to integrate multi-sensor signals, reason over road networks, populations and key facilities, plan evacuations, and produce actionable reports. However, prior work largely isolates remote-sensing perception or evaluates generic tool use, leaving the end-to-end workflows of emergency operations underexplored. In this paper, we introduce Disaster Operational Response Agent benchmark (DORA), the first agentic benchmark for end-to-end disaster response: 515 expert-authored tasks across 45 real-world disaster events spanning 10 types, paired with expert-verified, replayable gold trajectories totaling 3,500 tool-call steps. Tasks span five dimensions that cover the operational disaster-response pipeline: disaster perception, spatial relational analysis, rescue and evacuation planning, temporal evolution reasoning, and multi-modal report synthesis. Agents compose calls from a 108-tool MCP library over heterogeneous geospatial data: optical, SAR, and multi-spectral imagery across single-, bi-, and multi-temporal sequences (0.015-10m GSD), complemented by elevation and social vector layers. We comprehensively evaluate 13 frontier LLMs on our benchmark, revealing three persistent challenges: 1) disaster-domain grounding exposes unique failure modes (damage-semantic grounding, sensor-modality mismatch, and disaster-pipeline composition); 2) agents are doubly bottlenecked by tool selection and argument grounding, where gold tool-order hints improve accuracy by only 1.08-4.40%, and alternative scaffolds yield at most a 3.24% gain; 3) compositional fragility scales with trajectory length, the agent-to-gold gap widening from 7% to 56% on long pipelines. DORA establishes a rigorous testbed for operationally reliable disaster-response agents.</summary>
    <content type="html">&lt;p&gt;&lt;strong&gt;Publication:&lt;/strong&gt; DORA stress-tests LLM agents on real-world disaster operations that demand comprehensive orchestration of 108 specialized tools over heterogeneous geospatial data&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Category:&lt;/strong&gt; Method&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Tasks:&lt;/strong&gt; VG&lt;/p&gt;</content>
    <category term="Artificial Intelligence" />
  </entry>
  <entry>
    <title>GeoR-Bench: Evaluating Geoscience Visual Reasoning</title>
    <link href="http://arxiv.org/abs/2605.11541v1" rel="alternate" type="text/html" />
    <id>http://arxiv.org/abs/2605.11541v1</id>
    <published>2026-05-12T00:00:00Z</published>
    <updated>2026-05-12T00:00:00Z</updated>
    <author>
      <name>Yushuo Zheng</name>
    </author>
    <author>
      <name>Zicheng Zhang</name>
    </author>
    <author>
      <name>Huiyu Duan</name>
    </author>
    <author>
      <name>Chunyi Li</name>
    </author>
    <author>
      <name>Zijian Chen</name>
    </author>
    <author>
      <name>Ziheng Jia</name>
    </author>
    <author>
      <name>Yue Shi</name>
    </author>
    <author>
      <name>Ke Gu</name>
    </author>
    <author>
      <name>Xiongkuo Min</name>
    </author>
    <author>
      <name>Guangtao Zhai</name>
    </author>
    <summary type="text">Geoscience intelligence is expected to understand, reason about, and predict earth system changes to support human decision-making in critical domains such as disaster response, climate adaptation and environmental protection. Although current research has shown promising progress on specific geoscience tasks, such as remote sensing interpretation, geographic question-answering, existing benchmarks remain largely task-specific which failing to capture the open-ended real world geoscience problems. As a result, it remains unclear how far current AI systems are from achieving genuine geoscience intelligence. To address this gap, we present \textbf{GeoR-Bench}, a \underline{Bench}mark for evaluating \underline{Geo}science visual \underline{R}easoning through reasoning informed visual editing tasks. GeoR-Bench contains 440 curated samples spanning 6 geoscience categories and 24 task types, covering earth observation imagery and structured scientific representations such as maps and diagrams. We evaluate outputs along three dimensions, including reasoning, consistency, and quality. Benchmark results of 21 closed- and open-source multimodal models reveal that geoscience reasoning remains a critical bottleneck. The highest-performing model achieves 42.7\% overall strict accuracy, while the best open-source models only get 10.3\%. Notably, the visual consistency and image quality of the outputs frequently surpass their scientific accuracy. Ultimately, these findings indicate that current models generate superficially plausible results but fail to capture underlying earth science processes.</summary>
    <content type="html">&lt;p&gt;&lt;strong&gt;Category:&lt;/strong&gt; Method&lt;/p&gt;</content>
    <category term="Computer Vision" />
  </entry>
  <entry>
    <title>UHR-Micro: Diagnosing and Mitigating the Resolution Illusion in Earth Observation VLMs</title>
    <link href="http://arxiv.org/abs/2605.12237v1" rel="alternate" type="text/html" />
    <id>http://arxiv.org/abs/2605.12237v1</id>
    <published>2026-05-12T00:00:00Z</published>
    <updated>2026-05-12T00:00:00Z</updated>
    <author>
      <name>Shuo Ni</name>
    </author>
    <author>
      <name>Tong Wang</name>
    </author>
    <author>
      <name>Jing Zhang</name>
    </author>
    <author>
      <name>He Chen</name>
    </author>
    <author>
      <name>Haonan Guo</name>
    </author>
    <author>
      <name>Ning Zhang</name>
    </author>
    <author>
      <name>Bo Du</name>
    </author>
    <summary type="text">Vision-Language Models (VLMs) increasingly operate on ultra-high-resolution (UHR) Earth observation imagery, yet they remain vulnerable to a severe scale mismatch between large-scale scene context and micro-scale targets. We refer to this empirical gap as a "resolution illusion": higher input resolution provides the appearance of richer visual detail, but does not necessarily yield reliable perception of spatially small, task-relevant evidence. To benchmark this challenge, we introduce UHR-Micro, a benchmark comprising 11,253 instructions grounded in 1,212 UHR images, designed to evaluate VLMs at the spatial limits of native Earth observation imagery. UHR-Micro spans diverse micro-target scales, context requirements, task families, and visual conditions, and provides diagnostic annotations that support controlled evaluation and fine-grained error attribution. Experiments with representative high-resolution VLMs show substantial failures in spatial grounding and evidence parsing, despite access to high-resolution inputs. Further analysis suggests that these failures are not fully resolved by increasing model capacity, but are closely tied to insufficient guidance in locating and using task-relevant micro-evidence. Motivated by this finding, we propose Micro-evidence Active Perception (MAP), a reference agent that decomposes queries into evidence-seeking steps, actively inspects candidate regions, and grounds its answers in localized observations. MAP-Agent improves micro-level perception by making high-resolution reasoning evidence-centered rather than image-centered. Together, UHR-Micro and MAP-Agent provide a diagnostic platform for evaluating, understanding, and advancing high-resolution reasoning in Earth observation VLMs. Datasets and source code were released at https://github.com/MiliLab/UHR-Micro.</summary>
    <content type="html">&lt;p&gt;&lt;strong&gt;Code:&lt;/strong&gt; &lt;a href="https://github.com/MiliLab/UHR-Micro"&gt;https://github.com/MiliLab/UHR-Micro&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Category:&lt;/strong&gt; Method&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Tasks:&lt;/strong&gt; VG&lt;/p&gt;</content>
    <category term="Computer Vision" />
  </entry>
  <entry>
    <title>TAR: Text Semantic Assisted Cross-modal Image Registration Framework for Optical and SAR Images</title>
    <link href="http://arxiv.org/abs/2605.12064v1" rel="alternate" type="text/html" />
    <id>http://arxiv.org/abs/2605.12064v1</id>
    <published>2026-05-12T00:00:00Z</published>
    <updated>2026-05-12T00:00:00Z</updated>
    <author>
      <name>Zhuoyu Cai</name>
    </author>
    <author>
      <name>Dou Quan</name>
    </author>
    <author>
      <name>Ning Huyan</name>
    </author>
    <author>
      <name>Pei He</name>
    </author>
    <author>
      <name>Shuang Wang</name>
    </author>
    <author>
      <name>Licheng Jiao</name>
    </author>
    <summary type="text">Existing deep learning-based methods can capture shared features from optical and synthetic aperture radar (SAR) images for spatial alignment. However, optical-SAR registration remains challenging under large geometric deformations, because the model needs to simultaneously handle cross-modal appearance discrepancies and complex spatial transformations. To address this issue, this paper proposes a text semantic-assisted cross-modal image registration framework, named TAR, for optical and SAR images. TAR exploits text semantic priors from remote sensing scenes and land-cover categories to alleviate the modality gap and enhance cross-modal feature learning. TAR consists of three components: a multi-scale visual feature learning (MSFL) module, a text-assisted feature enhancement (TAFE) module, and a coarse-to-fine dense matching (CFDM) module. MSFL extracts multi-scale visual features from optical and SAR images. TAFE constructs text descriptors related to remote sensing scenes and land-cover objects, and uses a frozen RemoteCLIP text encoder to extract text features. These text features are introduced through visual-text interaction to enhance high-level visual features for more reliable coarse matching. CFDM then establishes coarse correspondences based on the enhanced high-level features and refines the matched locations using low-level features. Experimental results on cross-modal remote sensing images demonstrate the effectiveness of TAR, which achieves stronger matching performance than several state-of-the-art methods and yields significant gains under large geometric deformations.</summary>
    <content type="html">&lt;p&gt;&lt;strong&gt;Category:&lt;/strong&gt; Method&lt;/p&gt;</content>
    <category term="Computer Vision" />
  </entry>
  <entry>
    <title>No One Knows the State of the Art in Geospatial Foundation Models</title>
    <link href="http://arxiv.org/abs/2605.12678v1" rel="alternate" type="text/html" />
    <id>http://arxiv.org/abs/2605.12678v1</id>
    <published>2026-05-12T00:00:00Z</published>
    <updated>2026-05-12T00:00:00Z</updated>
    <author>
      <name>Isaac Corley</name>
    </author>
    <author>
      <name>Nils Lehmann</name>
    </author>
    <author>
      <name>Caleb Robinson</name>
    </author>
    <author>
      <name>Gabriel Tseng</name>
    </author>
    <author>
      <name>Anthony Fuller</name>
    </author>
    <author>
      <name>Hamed Alemohammad</name>
    </author>
    <author>
      <name>Evan Shelhamer</name>
    </author>
    <author>
      <name>Jennifer Marcus</name>
    </author>
    <author>
      <name>Hannah Kerner</name>
    </author>
    <summary type="text">Geospatial foundation models (GFMs) have been proposed as generalizable backbones for disaster response, land-cover mapping, food-security monitoring, and other high-stakes Earth-observation tasks. Yet the published work about these models does not give reviewers or users enough information to tell which model fits a given task. We argue that nobody knows what the current state of the art is in geospatial foundation models. The methods may be useful, but the GFM literature does not standardize evaluations, training and testing protocols, released weights, or pretraining controls well enough for anyone to compare or rank them. In a 152-paper audit, we find 46 cross-paper disagreements of at least 10 points for the same model, benchmark, and protocol; 94/126 papers with extractable pretraining data use a configuration no other paper uses; and 39% of GFM papers release no model weights. This lack of community standards can be solved. We propose six concrete expectations: named-license weight release, shared core evaluations, copied-versus-rerun baseline annotations, variance reporting, one shared evaluation harness, and data-vs-architecture-vs-algorithm controls. These gaps are a coordination failure, not a fault of any individual lab; the authors of this paper, like many others in the GFM community, have contributed to them. Rather than just critiquing the community, we aim to provide concrete steps toward a shared understanding of how to innovate GFMs.</summary>
    <content type="html">&lt;p&gt;&lt;strong&gt;Category:&lt;/strong&gt; Method&lt;/p&gt;</content>
    <category term="Computer Vision" />
    <category term="cs.CY" />
  </entry>
</feed>