<?xml version='1.0' encoding='utf-8'?>
<feed xmlns="http://www.w3.org/2005/Atom">
  <title>RS-Paper-Hub — All Papers</title>
  <id>https://rspaper.top/output/feed.xml</id>
  <link href="https://rspaper.top/output/feed.xml" rel="self" type="application/atom+xml" />
  <link href="https://rspaper.top" rel="alternate" type="text/html" />
  <updated>2026-04-20T10:19:45Z</updated>
  <subtitle>Latest remote sensing papers (last 7 days) — 19 entries</subtitle>
  <author>
    <name>RS-Paper-Hub</name>
    <uri>https://rspaper.top</uri>
  </author>
  <entry>
    <title>From Articles to Canopies: Knowledge-Driven Pseudo-Labelling for Tree Species Classification using LLM Experts</title>
    <link href="http://arxiv.org/abs/2604.16115v1" rel="alternate" type="text/html" />
    <id>http://arxiv.org/abs/2604.16115v1</id>
    <published>2026-04-17T00:00:00Z</published>
    <updated>2026-04-17T00:00:00Z</updated>
    <author>
      <name>Michał Romaszewski</name>
    </author>
    <author>
      <name>Dominik Kopeć</name>
    </author>
    <author>
      <name>Michał Cholewa</name>
    </author>
    <author>
      <name>Katarzyna Kołodziej</name>
    </author>
    <author>
      <name>Przemysław Głomb</name>
    </author>
    <author>
      <name>Jan Niedzielko</name>
    </author>
    <author>
      <name>Jakub Charyton</name>
    </author>
    <author>
      <name>Justyna Wylazłowska</name>
    </author>
    <author>
      <name>Anna Jarocińska</name>
    </author>
    <summary type="text">Hyperspectral tree species classification is challenging due to limited and imbalanced class labels, spectral mixing (overlapping light signatures from multiple species), and ecological heterogeneity (variability among ecological systems). Addressing these challenges requires methods that integrate biological and structural characteristics of vegetation, such as canopy architecture and interspecific interactions, rather than relying solely on spectral signatures. This paper presents a biologically informed, semi-supervised deep learning method that integrates multi-sensor Earth observation data, specifically hyperspectral imaging (HSI) and airborne laser scanning (ALS), with expert, ecological knowledge. The approach relies on biologically inspired pseudo-labelling over a precomputed canopy graph, yielding accurate classification at low training cost. In addition, ecological priors on species cohabitation are automatically derived from reliable sources using large language models (LLMs) and encoded as a cohabitation matrix with likelihoods of species occurring together. These priors are incorporated into the pseudo-labelling strategy, effectively introducing expert knowledge into the model. Experiments on a real-world forest dataset demonstrate 5.6% improvement over the best reference method. Expert evaluation of cohabitation priors reveals high accuracy with differences no larger than 15%.</summary>
    <content type="html">&lt;p&gt;&lt;strong&gt;Category:&lt;/strong&gt; Method&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Tasks:&lt;/strong&gt; CLS&lt;/p&gt;</content>
    <category term="Computer Vision" />
  </entry>
  <entry>
    <title>Robust Multispectral Semantic Segmentation under Missing or Full Modalities via Structured Latent Projection</title>
    <link href="http://arxiv.org/abs/2604.15856v1" rel="alternate" type="text/html" />
    <id>http://arxiv.org/abs/2604.15856v1</id>
    <published>2026-04-17T00:00:00Z</published>
    <updated>2026-04-17T00:00:00Z</updated>
    <author>
      <name>Irem Ulku</name>
    </author>
    <author>
      <name>Erdem Akagündüz</name>
    </author>
    <author>
      <name>Ömer Özgür Tanrıöver</name>
    </author>
    <summary type="text">Multimodal remote sensing data provide complementary information for semantic segmentation, but in real-world deployments, some modalities may be unavailable due to sensor failures, acquisition issues, or challenging atmospheric conditions. Existing multimodal segmentation models typically address missing modalities by learning a shared representation across inputs. However, this approach can introduce a trade-off by compromising modality-specific complementary information and reducing performance when all modalities are available. In this paper, we tackle this limitation with CBC-SLP, a multimodal semantic segmentation model designed to preserve both modality-invariant and modality-specific information. Inspired by the theoretical results on modality alignment, which state that perfectly aligned multimodal representations can lead to sub-optimal performance in downstream prediction tasks, we propose a novel structured latent projection approach as an architectural inductive bias. Rather than enforcing this strategy through a loss term, we incorporate it directly into the architecture. In particular, to use the complementary information effectively while maintaining robustness under random modality dropout, we structure the latent representations into shared and modality-specific components and adaptively transfer them to the decoder according to the random modality availability mask. Extensive experiments on three multimodal remote sensing image sets demonstrate that CBC-SLP consistently outperforms state-of-the-art multimodal models across full and missing modality scenarios. Besides, we empirically demonstrate that the proposed strategy can recover the complementary information that may not be preserved in a shared representation. The code is available at https://github.com/iremulku/Multispectral-Semantic-Segmentation-via-Structured-Latent-Projection-CBC-SLP-.</summary>
    <content type="html">&lt;p&gt;&lt;strong&gt;Code:&lt;/strong&gt; &lt;a href="https://github.com/iremulku/Multispectral-Semantic-Segmentation-via-Structured-Latent-Projection-CBC-SLP-"&gt;https://github.com/iremulku/Multispectral-Semantic-Segmentation-via-Structured-Latent-Projection-CBC-SLP-&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Publication:&lt;/strong&gt; 15 pages, 7 figures&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Category:&lt;/strong&gt; Method&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Tasks:&lt;/strong&gt; SEG&lt;/p&gt;</content>
    <category term="Computer Vision" />
    <category term="Artificial Intelligence" />
  </entry>
  <entry>
    <title>SSFT: A Lightweight Spectral-Spatial Fusion Transformer for Generic Hyperspectral Classification</title>
    <link href="http://arxiv.org/abs/2604.15828v1" rel="alternate" type="text/html" />
    <id>http://arxiv.org/abs/2604.15828v1</id>
    <published>2026-04-17T00:00:00Z</published>
    <updated>2026-04-17T00:00:00Z</updated>
    <author>
      <name>Alexander Musiat</name>
    </author>
    <author>
      <name>Nikolas Ebert</name>
    </author>
    <author>
      <name>Oliver Wasenmüller</name>
    </author>
    <summary type="text">Hyperspectral imaging enables fine-grained recognition of materials by capturing rich spectral signatures, but learning robust classifiers is challenging due to high dimensionality, spectral redundancy, limited labeled data, and strong domain shifts. Beyond earth observation, labeled HSI data is often scarce and imbalanced, motivating compact models for generic hyperspectral classification across diverse acquisition regimes. We propose the lightweight Spectral-Spatial Fusion Transformer (SSFT), which factorizes representation learning into spectral and spatial pathways and integrates them via cross-attention to capture complementary wavelength-dependent and structural information. We evaluate our SSFT on the challenging HSI-Benchmark, a heterogeneous multi-dataset benchmark covering earth observation, fruit condition assessment, and fine-grained material recognition. SSFT achieves state-of-the-art overall performance, ranking first while using less than 2% of the parameters of the previous leading method. We further evaluate transfer to the substantially larger SpectralEarth benchmark under the official protocol, where SSFT remains competitive despite its compact size. Ablation studies show that both spectral and spatial pathways are crucial, with spatial modeling contributing most, and that SSFT remains robust without data augmentation.</summary>
    <content type="html">&lt;p&gt;&lt;strong&gt;Publication:&lt;/strong&gt; IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Category:&lt;/strong&gt; Method&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Tasks:&lt;/strong&gt; CLS&lt;/p&gt;</content>
    <category term="Computer Vision" />
  </entry>
  <entry>
    <title>PixDLM: A Dual-Path Multimodal Language Model for UAV Reasoning Segmentation</title>
    <link href="http://arxiv.org/abs/2604.15670v1" rel="alternate" type="text/html" />
    <id>http://arxiv.org/abs/2604.15670v1</id>
    <published>2026-04-17T00:00:00Z</published>
    <updated>2026-04-17T00:00:00Z</updated>
    <author>
      <name>Shuyan Ke</name>
    </author>
    <author>
      <name>Yifan Mei</name>
    </author>
    <author>
      <name>Changli Wu</name>
    </author>
    <author>
      <name>Yonghan Zheng</name>
    </author>
    <author>
      <name>Jiayi Ji</name>
    </author>
    <author>
      <name>Liujuan Cao</name>
    </author>
    <author>
      <name>Rongrong Ji</name>
    </author>
    <summary type="text">Reasoning segmentation has recently expanded from ground-level scenes to remote-sensing imagery, yet UAV data poses distinct challenges, including oblique viewpoints, ultra-high resolutions, and extreme scale variations. To address these issues, we formally define the UAV Reasoning Segmentation task and organize its semantic requirements into three dimensions: Spatial, Attribute, and Scene-level reasoning. Based on this formulation, we construct DRSeg, a large-scale benchmark for UAV reasoning segmentation, containing 10k high-resolution aerial images paired with Chain-of-Thought QA supervision across all three reasoning types. As a benchmark companion, we introduce PixDLM, a simple yet effective pixel-level multimodal language model that serves as a unified baseline for this task. Experiments on DRSeg establish strong baseline results and highlight the unique challenges of UAV reasoning segmentation, providing a solid foundation for future research.</summary>
    <content type="html">&lt;p&gt;&lt;strong&gt;Publication:&lt;/strong&gt; CVPR 2026&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Category:&lt;/strong&gt; Method&lt;/p&gt;</content>
    <category term="Computer Vision" />
  </entry>
  <entry>
    <title>Towards Realistic Open-Vocabulary Remote Sensing Segmentation: Benchmark and Baseline</title>
    <link href="http://arxiv.org/abs/2604.15652v1" rel="alternate" type="text/html" />
    <id>http://arxiv.org/abs/2604.15652v1</id>
    <published>2026-04-17T00:00:00Z</published>
    <updated>2026-04-17T00:00:00Z</updated>
    <author>
      <name>Bingyu Li</name>
    </author>
    <author>
      <name>Tao Huo</name>
    </author>
    <author>
      <name>Haocheng Dong</name>
    </author>
    <author>
      <name>Da Zhang</name>
    </author>
    <author>
      <name>Zhiyuan Zhao</name>
    </author>
    <author>
      <name>Junyu Gao</name>
    </author>
    <author>
      <name>Xuelong Li</name>
    </author>
    <summary type="text">Open-vocabulary remote sensing image segmentation (OVRSIS) remains underexplored due to fragmented datasets, limited training diversity, and the lack of evaluation benchmarks that reflect realistic geospatial application demands. Our previous \textit{OVRSISBenchV1} established an initial cross-dataset evaluation protocol, but its limited scope is insufficient for assessing realistic open-world generalization. To address this issue, we propose \textit{OVRSISBenchV2}, a large-scale and application-oriented benchmark for OVRSIS. We first construct \textbf{OVRSIS95K}, a balanced dataset of about 95K image--mask pairs covering 35 common semantic categories across diverse remote sensing scenes. Built upon OVRSIS95K and 10 downstream datasets, OVRSISBenchV2 contains 170K images and 128 categories, substantially expanding scene diversity, semantic coverage, and evaluation difficulty. Beyond standard open-vocabulary segmentation, it further includes downstream protocols for building extraction, road extraction, and flood detection, thereby better reflecting realistic geospatial application demands and complex deployment scenarios. We also propose \textbf{Pi-Seg}, a baseline for OVRSIS. Pi-Seg improves transferability through a \textbf{positive-incentive noise} mechanism, where learnable and semantically guided perturbations broaden the visual-text feature space during training. Extensive experiments on OVRSISBenchV1, OVRSISBenchV2, and downstream tasks show that Pi-Seg delivers strong and consistent results, particularly on the more challenging OVRSISBenchV2 benchmark. Our results highlight both the importance of realistic benchmark design and the effectiveness of perturbation-based transfer for OVRSIS. The code and datasets are available at \href{https://github.com/LiBingyu01/RSKT-Seg/tree/Pi-Seg}{LiBingyu01/RSKT-Seg/tree/Pi-Seg}.</summary>
    <content type="html">&lt;p&gt;&lt;strong&gt;Code:&lt;/strong&gt; &lt;a href="https://github.com/LiBingyu01/RSKT-Seg/tree/Pi-Seg"&gt;https://github.com/LiBingyu01/RSKT-Seg/tree/Pi-Seg&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Category:&lt;/strong&gt; Dataset&lt;/p&gt;</content>
    <category term="Computer Vision" />
  </entry>
  <entry>
    <title>Physically-Induced Atmospheric Adversarial Perturbations: Enhancing Transferability and Robustness in Remote Sensing Image Classification</title>
    <link href="http://arxiv.org/abs/2604.14643v1" rel="alternate" type="text/html" />
    <id>http://arxiv.org/abs/2604.14643v1</id>
    <published>2026-04-16T00:00:00Z</published>
    <updated>2026-04-16T00:00:00Z</updated>
    <author>
      <name>Weiwei Zhuang</name>
    </author>
    <author>
      <name>Wangze Xie</name>
    </author>
    <author>
      <name>Qi Zhang</name>
    </author>
    <author>
      <name>Xia Du</name>
    </author>
    <author>
      <name>Zihan Lin</name>
    </author>
    <author>
      <name>Zheng Lin</name>
    </author>
    <author>
      <name>Hanlin Cai</name>
    </author>
    <author>
      <name>Jizhe Zhou</name>
    </author>
    <author>
      <name>Zihan Fang</name>
    </author>
    <author>
      <name>Chi-man Pun</name>
    </author>
    <author>
      <name>Wei Ni</name>
    </author>
    <author>
      <name>Jun Luo</name>
    </author>
    <summary type="text">Adversarial attacks pose a severe threat to the reliability of deep learning models in remote sensing (RS) image classification. Most existing methods rely on direct pixel-wise perturbations, failing to exploit the inherent atmospheric characteristics of RS imagery or survive real-world image degradations. In this paper, we propose FogFool, a physically plausible adversarial framework that generates fog-based perturbations by iteratively optimizing atmospheric patterns based on Perlin noise. By modeling fog formations with natural, irregular structures, FogFool generates adversarial examples that are not only visually consistent with authentic RS scenes but also deceptive. By leveraging the spatial coherence and mid-to-low-frequency nature of atmospheric phenomena, FogFool embeds adversarial information into structural features shared across diverse architectures. Extensive experiments on two benchmark RS datasets demonstrate that FogFool achieves superior performance: not only does it exceed in white-box settings, but also exhibits exceptional black-box transferability (reaching 83.74% TASR) and robustness against common preprocessing-based defenses such as JPEG compression and filtering. Detailed analyses, including confusion matrices and Class Activation Map (CAM) visualizations, reveal that our atmospheric-driven perturbations induce a universal shift in model attention. These results indicate that FogFool represents a practical, stealthy, and highly persistent threat to RS classification systems, providing a robust benchmark for evaluating model reliability in complex environments.</summary>
    <content type="html">&lt;p&gt;&lt;strong&gt;Publication:&lt;/strong&gt; 14 pages, 11 figures&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Category:&lt;/strong&gt; Method&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Tasks:&lt;/strong&gt; CLS&lt;/p&gt;</content>
    <category term="Computer Vision" />
    <category term="Machine Learning" />
  </entry>
  <entry>
    <title>OmniGCD: Abstracting Generalized Category Discovery for Modality Agnosticism</title>
    <link href="http://arxiv.org/abs/2604.14762v1" rel="alternate" type="text/html" />
    <id>http://arxiv.org/abs/2604.14762v1</id>
    <published>2026-04-16T00:00:00Z</published>
    <updated>2026-04-16T00:00:00Z</updated>
    <author>
      <name>Jordan Shipard</name>
    </author>
    <author>
      <name>Arnold Wiliem</name>
    </author>
    <author>
      <name>Kien Nguyen Thanh</name>
    </author>
    <author>
      <name>Wei Xiang</name>
    </author>
    <author>
      <name>Clinton Fookes</name>
    </author>
    <summary type="text">Generalized Category Discovery (GCD) challenges methods to identify known and novel classes using partially labeled data, mirroring human category learning. Unlike prior GCD methods, which operate within a single modality and require dataset-specific fine-tuning, we propose a modality-agnostic GCD approach inspired by the human brain's abstract category formation. Our $\textbf{OmniGCD}$ leverages modality-specific encoders (e.g., vision, audio, text, remote sensing) to process inputs, followed by dimension reduction to construct a $\textbf{GCD latent space}$, which is transformed at test-time into a representation better suited for clustering using a novel synthetically trained Transformer-based model. To evaluate OmniGCD, we introduce a $\textbf{zero-shot GCD setting}$ where no dataset-specific fine-tuning is allowed, enabling modality-agnostic category discovery. $\textbf{Trained once on synthetic data}$, OmniGCD performs zero-shot GCD across 16 datasets spanning four modalities, improving classification accuracy for known and novel classes over baselines (average percentage point improvement of $\textbf{+6.2}$, $\textbf{+17.9}$, $\textbf{+1.5}$ and $\textbf{+12.7}$ for vision, text, audio and remote sensing). This highlights the importance of strong encoders while decoupling representation learning from category discovery. Improving modality-agnostic methods will propagate across modalities, enabling encoder development independent of GCD. Our work serves as a benchmark for future modality-agnostic GCD works, paving the way for scalable, human-inspired category discovery. All code is available $\href{https://github.com/Jordan-HS/OmniGCD}{here}$</summary>
    <content type="html">&lt;p&gt;&lt;strong&gt;Code:&lt;/strong&gt; &lt;a href="https://github.com/Jordan-HS/OmniGCD"&gt;https://github.com/Jordan-HS/OmniGCD&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Publication:&lt;/strong&gt; CVPR 2026&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Category:&lt;/strong&gt; Method&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Tasks:&lt;/strong&gt; CLS&lt;/p&gt;</content>
    <category term="Computer Vision" />
  </entry>
  <entry>
    <title>Building Extraction from Remote Sensing Imagery under Hazy and Low-light Conditions: Benchmark and Baseline</title>
    <link href="http://arxiv.org/abs/2604.15088v1" rel="alternate" type="text/html" />
    <id>http://arxiv.org/abs/2604.15088v1</id>
    <published>2026-04-16T00:00:00Z</published>
    <updated>2026-04-16T00:00:00Z</updated>
    <author>
      <name>Feifei Sang</name>
    </author>
    <author>
      <name>Wei Lu</name>
    </author>
    <author>
      <name>Hongruixuan Chen</name>
    </author>
    <author>
      <name>Sibao Chen</name>
    </author>
    <author>
      <name>Bin Luo</name>
    </author>
    <summary type="text">Building extraction from optical Remote Sensing (RS) imagery suffers from performance degradation under real-world hazy and low-light conditions. However, existing optical methods and benchmarks focus primarily on ideal clear-weather conditions. While SAR offers all-weather sensing, its side-looking geometry causes geometric distortions. To address these challenges, we introduce HaLoBuilding, the first optical benchmark specifically designed for building extraction under hazy and low-light conditions. By leveraging a same-scene multitemporal pairing strategy, we ensure pixel-level label alignment and high fidelity even under extreme degradation. Building upon this benchmark, we propose HaLoBuild-Net, a novel end-to-end framework for building extraction in adverse RS scenarios. At its core, we develop a Spatial-Frequency Focus Module (SFFM) to effectively mitigate meteorological interference on building features by coupling large receptive field attention with frequency-aware channel reweighting guided by stable low-frequency anchors. Additionally, a Global Multi-scale Guidance Module (GMGM) provides global semantic constraints to anchor building topologies, while a Mutual-Guided Fusion Module (MGFM) implements bidirectional semantic-spatial calibration to suppress shallow noise and sharpen weather-induced blurred boundaries. Extensive experiments demonstrate that HaLoBuild-Net significantly outperforms state-of-the-art methods and conventional cascaded restoration-segmentation paradigms on the HaLoBuilding dataset, while maintaining robust generalization on WHU, INRIA, and LoveDA datasets. The source code and datasets are publicly available at: https://github.com/AeroVILab-AHU/HaLoBuilding.</summary>
    <content type="html">&lt;p&gt;&lt;strong&gt;Code:&lt;/strong&gt; &lt;a href="https://github.com/AeroVILab-AHU/HaLoBuilding"&gt;https://github.com/AeroVILab-AHU/HaLoBuilding&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Publication:&lt;/strong&gt; 14 pages, 12 figures&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Category:&lt;/strong&gt; Dataset&lt;/p&gt;</content>
    <category term="Computer Vision" />
  </entry>
  <entry>
    <title>UHR-BAT: Budget-Aware Token Compression Vision-Language model for Ultra-High-Resolution Remote Sensing</title>
    <link href="http://arxiv.org/abs/2604.13565v1" rel="alternate" type="text/html" />
    <id>http://arxiv.org/abs/2604.13565v1</id>
    <published>2026-04-15T00:00:00Z</published>
    <updated>2026-04-15T00:00:00Z</updated>
    <author>
      <name>Yunkai Dang</name>
    </author>
    <author>
      <name>Minxin Dai</name>
    </author>
    <author>
      <name>Yuekun Yang</name>
    </author>
    <author>
      <name>Zhangnan Li</name>
    </author>
    <author>
      <name>Wenbin Li</name>
    </author>
    <author>
      <name>Feng Miao</name>
    </author>
    <author>
      <name>Yang Gao</name>
    </author>
    <summary type="text">Ultra-high-resolution (UHR) remote sensing imagery couples kilometer-scale context with query-critical evidence that may occupy only a few pixels. Such vast spatial scale leads to a quadratic explosion of visual tokens and hinders the extraction of information from small objects. Previous works utilize direct downsampling, dense tiling, or global top-k pruning, which either compromise query-critical image details or incur unpredictable compute. In this paper, we propose UHR-BAT, a query-guided and region-faithful token compression framework to efficiently select visual tokens under a strict context budget. Specifically, we leverage text-guided, multi-scale importance estimation for visual tokens, effectively tackling the challenge of achieving precise yet low-cost feature extraction. Furthermore, by introducing region-wise preserve and merge strategies, we mitigate visual token redundancy, further driving down the computational budget. Experimental results show that UHR-BAT achieves state-of-the-art performance across various benchmarks. Code will be available at https://github.com/Yunkaidang/UHR.</summary>
    <content type="html">&lt;p&gt;&lt;strong&gt;Code:&lt;/strong&gt; &lt;a href="https://github.com/Yunkaidang/UHR"&gt;https://github.com/Yunkaidang/UHR&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Category:&lt;/strong&gt; Method&lt;/p&gt;</content>
    <category term="Computer Vision" />
    <category term="Artificial Intelligence" />
  </entry>
  <entry>
    <title>Remote Sensing Image Super-Resolution for Imbalanced Textures: A Texture-Aware Diffusion Framework</title>
    <link href="http://arxiv.org/abs/2604.13994v1" rel="alternate" type="text/html" />
    <id>http://arxiv.org/abs/2604.13994v1</id>
    <published>2026-04-15T00:00:00Z</published>
    <updated>2026-04-15T00:00:00Z</updated>
    <author>
      <name>Enzhuo Zhang</name>
    </author>
    <author>
      <name>Sijie Zhao</name>
    </author>
    <author>
      <name>Dilxat Muhtar</name>
    </author>
    <author>
      <name>Zhenshi Li</name>
    </author>
    <author>
      <name>Xueliang Zhang</name>
    </author>
    <author>
      <name>Pengfeng Xiao</name>
    </author>
    <summary type="text">Generative diffusion priors have recently achieved state-of-the-art performance in natural image super-resolution, demonstrating a powerful capability to synthesize photorealistic details. However, their direct application to remote sensing image super-resolution (RSISR) reveals significant shortcomings. Unlike natural images, remote sensing images exhibit a unique texture distribution where ground objects are globally stochastic yet locally clustered, leading to highly imbalanced textures. This imbalance severely hinders the model's spatial perception. To address this, we propose TexADiff, a novel framework that begins by estimating a Relative Texture Density Map (RTDM) to represent the texture distribution. TexADiff then leverages this RTDM in three synergistic ways: as an explicit spatial conditioning to guide the diffusion process, as a loss modulation term to prioritize texture-rich regions, and as a dynamic adapter for the sampling schedule. These modifications are designed to endow the model with explicit texture-aware capabilities. Experiments demonstrate that TexADiff achieves superior or competitive quantitative metrics. Furthermore, qualitative results show that our model generates faithful high-frequency details while effectively suppressing texture hallucinations. This improved reconstruction quality also results in significant gains in downstream task performance. The source code of our method can be found at https://github.com/ZezFuture/TexAdiff.</summary>
    <content type="html">&lt;p&gt;&lt;strong&gt;Code:&lt;/strong&gt; &lt;a href="https://github.com/ZezFuture/TexAdiff"&gt;https://github.com/ZezFuture/TexAdiff&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Publication:&lt;/strong&gt; 10 pages, 5 figures&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Category:&lt;/strong&gt; Method&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Tasks:&lt;/strong&gt; SR&lt;/p&gt;</content>
    <category term="Computer Vision" />
  </entry>
  <entry>
    <title>Decoding the Delta: Unifying Remote Sensing Change Detection and Understanding with Multimodal Large Language Models</title>
    <link href="http://arxiv.org/abs/2604.14044v1" rel="alternate" type="text/html" />
    <id>http://arxiv.org/abs/2604.14044v1</id>
    <published>2026-04-15T00:00:00Z</published>
    <updated>2026-04-15T00:00:00Z</updated>
    <author>
      <name>Xiaohe Li</name>
    </author>
    <author>
      <name>Jiahao Li</name>
    </author>
    <author>
      <name>Kaixin Zhang</name>
    </author>
    <author>
      <name>Yuqiang Fang</name>
    </author>
    <author>
      <name>Leilei Lin</name>
    </author>
    <author>
      <name>Hong Wang</name>
    </author>
    <author>
      <name>Haohua Wu</name>
    </author>
    <author>
      <name>Zide Fan</name>
    </author>
    <summary type="text">While Multimodal Large Language Models (MLLMs) excel in general vision-language tasks, their application to remote sensing change understanding is hindered by a fundamental "temporal blindness". Existing architectures lack intrinsic mechanisms for multi-temporal contrastive reasoning and struggle with precise spatial grounding. To address this, we first introduce Delta-QA, a comprehensive benchmark comprising 180k visual question-answering samples. Delta-QA unifies pixel-level segmentation and visual question answering across bi- and tri-temporal scenarios, structuring change interpretation into four progressive cognitive dimensions. Methodologically, we propose Delta-LLaVA, a novel MLLM framework explicitly tailored for multi-temporal remote sensing interpretation. It overcomes the limitations of naive feature concatenation through three core innovations: a Change-Enhanced Attention module that systematically isolates and amplifies visual differences, a Change-SEG module utilizing Change Prior Embedding to extract differentiable difference features as input for the LLM, and Local Causal Attention to prevent cross-temporal contextual leakage. Extensive experiments demonstrate that Delta-LLaVA decisively outperforms leading generalist MLLMs and specialized segmentation models in complex change deduction and high-precision boundary localization, establishing a unified framework for earth observation intelligence.</summary>
    <content type="html">&lt;p&gt;&lt;strong&gt;Category:&lt;/strong&gt; Method&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Tasks:&lt;/strong&gt; VQA;CD;VG&lt;/p&gt;</content>
    <category term="Computer Vision" />
  </entry>
  <entry>
    <title>SatBLIP: Context Understanding and Feature Identification from Satellite Imagery with Vision-Language Learning</title>
    <link href="http://arxiv.org/abs/2604.14373v1" rel="alternate" type="text/html" />
    <id>http://arxiv.org/abs/2604.14373v1</id>
    <published>2026-04-15T00:00:00Z</published>
    <updated>2026-04-15T00:00:00Z</updated>
    <author>
      <name>Xue Wu</name>
    </author>
    <author>
      <name>Shengting Cao</name>
    </author>
    <author>
      <name>Jiaqi Gong</name>
    </author>
    <summary type="text">Rural environmental risks are shaped by place-based conditions (e.g., housing quality, road access, land-surface patterns), yet standard vulnerability indices are coarse and provide limited insight into risk contexts. We propose SatBLIP, a satellite-specific vision-language framework for rural context understanding and feature identification that predicts county-level Social Vulnerability Index (SVI). SatBLIP addresses limitations of prior remote sensing pipelines-handcrafted features, manual virtual audits, and natural-image-trained VLMs-by coupling contrastive image-text alignment with bootstrapped captioning tailored to satellite semantics. We use GPT-4o to generate structured descriptions of satellite tiles (roof type/condition, house size, yard attributes, greenery, and road context), then fine-tune a satellite-adapted BLIP model to generate captions for unseen images. Captions are encoded with CLIP and fused with LLM-derived embeddings via attention for SVI estimation under spatial aggregation. Using SHAP, we identify salient attributes (e.g., roof form/condition, street width, vegetation, cars/open space) that consistently drive robust predictions, enabling interpretable mapping of rural risk environments.</summary>
    <content type="html">&lt;p&gt;&lt;strong&gt;Category:&lt;/strong&gt; Method&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Tasks:&lt;/strong&gt; IC&lt;/p&gt;</content>
    <category term="Computer Vision" />
    <category term="Artificial Intelligence" />
  </entry>
  <entry>
    <title>SatBLIP: Context Understanding and Feature Identification from Satellite Imagery with Vision-Language Learning</title>
    <link href="http://arxiv.org/abs/2604.14373v2" rel="alternate" type="text/html" />
    <id>http://arxiv.org/abs/2604.14373v2</id>
    <published>2026-04-15T00:00:00Z</published>
    <updated>2026-04-15T00:00:00Z</updated>
    <author>
      <name>Xue Wu</name>
    </author>
    <author>
      <name>Shengting Cao</name>
    </author>
    <author>
      <name>Shenglin Li</name>
    </author>
    <author>
      <name>Jiaqi Gong</name>
    </author>
    <summary type="text">Rural environmental risks are shaped by place-based conditions (e.g., housing quality, road access, land-surface patterns), yet standard vulnerability indices are coarse and provide limited insight into risk contexts. We propose SatBLIP, a satellite-specific vision-language framework for rural context understanding and feature identification that predicts county-level Social Vulnerability Index (SVI). SatBLIP addresses limitations of prior remote sensing pipelines-handcrafted features, manual virtual audits, and natural-image-trained VLMs-by coupling contrastive image-text alignment with bootstrapped captioning tailored to satellite semantics. We use GPT-4o to generate structured descriptions of satellite tiles (roof type/condition, house size, yard attributes, greenery, and road context), then fine-tune a satellite-adapted BLIP model to generate captions for unseen images. Captions are encoded with CLIP and fused with LLM-derived embeddings via attention for SVI estimation under spatial aggregation. Using SHAP, we identify salient attributes (e.g., roof form/condition, street width, vegetation, cars/open space) that consistently drive robust predictions, enabling interpretable mapping of rural risk environments.</summary>
    <content type="html">&lt;p&gt;&lt;strong&gt;Category:&lt;/strong&gt; Method&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Tasks:&lt;/strong&gt; IC&lt;/p&gt;</content>
    <category term="Computer Vision" />
    <category term="Artificial Intelligence" />
  </entry>
  <entry>
    <title>GCA Framework: A Gulf-Grounded Dataset and Agentic Pipeline for Climate Decision Support</title>
    <link href="http://arxiv.org/abs/2604.12306v1" rel="alternate" type="text/html" />
    <id>http://arxiv.org/abs/2604.12306v1</id>
    <published>2026-04-14T00:00:00Z</published>
    <updated>2026-04-14T00:00:00Z</updated>
    <author>
      <name>Muhammad Umer Sheikh</name>
    </author>
    <author>
      <name>Khawar Shehzad</name>
    </author>
    <author>
      <name>Salman Khan</name>
    </author>
    <author>
      <name>Fahad Shahbaz Khan</name>
    </author>
    <author>
      <name>Muhammad Haris Khan</name>
    </author>
    <summary type="text">Climate decision-making in the Gulf increasingly demands systems that can translate heterogeneous scientific and policy evidence into actionable guidance, yet general-purpose large language models (LLMs) remain weak both in region-specific climate knowledge and grounded interaction with geospatial and forecasting tools. We present the GCA framework, which unifies (i) GCA-DS, a curated Gulf-focused multimodal dataset, and (ii) Gulf Climate Agent (GCA), a tool-augmented agent for climate analysis. GCA-DS comprises ~200k question-answer pairs spanning governmental policies and adaptation plans, NGO and international frameworks, academic literature, and event-driven reporting on heatwaves, dust storms, and floods, complemented with remote-sensing inputs that couple imagery with textual evidence. Building on this foundation, the GCA agent orchestrates a modular tool pipeline grounded in real-time and historical signals and geospatial processing that produces derived indices and interpretable visualizations. Finally, we benchmark open and proprietary LLMs on Gulf climate tasks and show that domain fine-tuning and tool integration substantially improve reliability over general-purpose baselines.</summary>
    <content type="html">&lt;p&gt;&lt;strong&gt;Category:&lt;/strong&gt; Dataset&lt;/p&gt;</content>
    <category term="Machine Learning" />
    <category term="Artificial Intelligence" />
  </entry>
  <entry>
    <title>GTPBD-MM: A Global Terraced Parcel and Boundary Dataset with Multi-Modality</title>
    <link href="http://arxiv.org/abs/2604.12315v1" rel="alternate" type="text/html" />
    <id>http://arxiv.org/abs/2604.12315v1</id>
    <published>2026-04-14T00:00:00Z</published>
    <updated>2026-04-14T00:00:00Z</updated>
    <author>
      <name>Zhiwei Zhang</name>
    </author>
    <author>
      <name>Xingyuan Zeng</name>
    </author>
    <author>
      <name>Xinkai Kong</name>
    </author>
    <author>
      <name>Kunquan Zhang</name>
    </author>
    <author>
      <name>Haoyuan Liang</name>
    </author>
    <author>
      <name>Bohan Shi</name>
    </author>
    <author>
      <name>Juepeng Zheng</name>
    </author>
    <author>
      <name>Jianxi Huang</name>
    </author>
    <author>
      <name>Yutong Lu</name>
    </author>
    <author>
      <name>Haohuan Fu</name>
    </author>
    <summary type="text">Agricultural parcel extraction plays an important role in remote sensing-based agricultural monitoring, supporting parcel surveying, precision management, and ecological assessment. However, existing public benchmarks mainly focus on regular and relatively flat farmland scenes. In contrast, terraced parcels in mountainous regions exhibit stepped terrain, pronounced elevation variation, irregular boundaries, and strong cross-regional heterogeneity, making parcel extraction a more challenging problem that jointly requires visual recognition, semantic discrimination, and terrain-aware geometric understanding. Although recent studies have advanced visual parcel benchmarks and image-text farmland understanding, a unified benchmark for complex terraced parcel extraction under aligned image-text-DEM settings remains absent. To fill this gap, we present GTPBD-MM, the first multimodal benchmark for global terraced parcel extraction. Built upon GTPBD, GTPBD-MM integrates high-resolution optical imagery, structured text descriptions, and DEM data, and supports systematic evaluation under Image-only, Image+Text, and Image+Text+DEM settings. We further propose Elevation and Text guided Terraced parcel network (ETTerra), a multimodal baseline for terraced parcel delineation. Extensive experiments demonstrate that textual semantics and terrain geometry provide complementary cues beyond visual appearance alone, yielding more accurate, coherent, and structurally consistent delineation results in complex terraced scenes.</summary>
    <content type="html">&lt;p&gt;&lt;strong&gt;Publication:&lt;/strong&gt; 15 pages, 11 figures&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Category:&lt;/strong&gt; Dataset&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Tasks:&lt;/strong&gt; 3D&lt;/p&gt;</content>
    <category term="Computer Vision" />
    <category term="Multimedia" />
  </entry>
  <entry>
    <title>A Multi-Agent Feedback System for Detecting and Describing News Events in Satellite Imagery</title>
    <link href="http://arxiv.org/abs/2604.12772v1" rel="alternate" type="text/html" />
    <id>http://arxiv.org/abs/2604.12772v1</id>
    <published>2026-04-14T00:00:00Z</published>
    <updated>2026-04-14T00:00:00Z</updated>
    <author>
      <name>Madeline Anderson</name>
    </author>
    <author>
      <name>Mikhail Klassen</name>
    </author>
    <author>
      <name>Ash Hoover</name>
    </author>
    <author>
      <name>Kerri Cahoy</name>
    </author>
    <summary type="text">Changes in satellite imagery often occur over multiple time steps. Despite the emergence of bi-temporal change captioning datasets, there is a lack of multi-temporal event captioning datasets (at least two images per sequence) in remote sensing. This gap exists because (1) searching for visible events in satellite imagery and (2) labeling multi-temporal sequences require significant time and labor. To address these challenges, we present SkyScraper, an iterative multi-agent workflow that geocodes news articles and synthesizes captions for corresponding satellite image sequences. Our experiments show that SkyScraper successfully finds 5x more events than traditional geocoding methods, demonstrating that agentic feedback is an effective strategy for surfacing new multi-temporal events in satellite imagery. We apply our framework to a large database of global news articles, curating a new multi-temporal captioning dataset with 5,000 sequences. By automatically identifying imagery related to news events, our work also supports journalism and reporting efforts.</summary>
    <content type="html">&lt;p&gt;&lt;strong&gt;Category:&lt;/strong&gt; Method&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Tasks:&lt;/strong&gt; IC;CD&lt;/p&gt;</content>
    <category term="Computer Vision" />
    <category term="cs.MA" />
  </entry>
  <entry>
    <title>Optimizing Earth Observation Satellite Schedules under Unknown Operational Constraints: An Active Constraint Acquisition Approach</title>
    <link href="http://arxiv.org/abs/2604.13283v1" rel="alternate" type="text/html" />
    <id>http://arxiv.org/abs/2604.13283v1</id>
    <published>2026-04-14T00:00:00Z</published>
    <updated>2026-04-14T00:00:00Z</updated>
    <author>
      <name>Mohamed-Bachir Belaid</name>
    </author>
    <summary type="text">Earth Observation (EO) satellite scheduling (deciding which imaging tasks to perform and when) is a well-studied combinatorial optimization problem. Existing methods typically assume that the operational constraint model is fully specified in advance. In practice, however, constraints governing separation between observations, power budgets, and thermal limits are often embedded in engineering artefacts or high-fidelity simulators rather than in explicit mathematical models. We study EO scheduling under \emph{unknown constraints}: the objective is known, but feasibility must be learned interactively from a binary oracle. Working with a simplified model restricted to pairwise separation and global capacity constraints, we introduce Conservative Constraint Acquisition~(CCA), a domain-specific procedure designed to identify justified constraints efficiently in practice while limiting unnecessary tightening of the learned model. Embedded in the \textsc{Learn\&amp;Optimize} framework, CCA supports an interactive search process that alternates optimization under a learned constraint model with targeted oracle queries. On synthetic instances with up to 50~tasks and dense constraint networks, L\&amp;O improves over a no-knowledge greedy baseline and uses far fewer main oracle queries than a two-phase acquire-then-solve baseline (FAO). For $n\leq 30$, the average gap drops from 65--68\% (Priority Greedy) to 17.7--35.8\% using L\&amp;O. At $n{=}50$, where the CP-SAT reference is the best feasible solution found in 120~s, L\&amp;O improves on FAO on average (17.9\% vs.\ 20.3\%) while using 21.3 main queries instead of 100 and about $5\times$ less execution time.</summary>
    <content type="html">&lt;p&gt;&lt;strong&gt;Category:&lt;/strong&gt; Method&lt;/p&gt;</content>
    <category term="Artificial Intelligence" />
    <category term="Machine Learning" />
  </entry>
  <entry>
    <title>The Spectrascapes Dataset: Street-view imagery beyond the visible captured using a mobile platform</title>
    <link href="http://arxiv.org/abs/2604.13315v2" rel="alternate" type="text/html" />
    <id>http://arxiv.org/abs/2604.13315v2</id>
    <published>2026-04-14T00:00:00Z</published>
    <updated>2026-04-14T00:00:00Z</updated>
    <author>
      <name>Akshit Gupta</name>
    </author>
    <author>
      <name>Joris Timmermans</name>
    </author>
    <author>
      <name>Filip Biljecki</name>
    </author>
    <author>
      <name>Remko Uijlenhoet</name>
    </author>
    <summary type="text">High-resolution data in spatial and temporal contexts is imperative for developing climate resilient cities. Current datasets for monitoring urban parameters are developed primarily using manual inspections, embedded-sensing, remote sensing, or standard street-view imagery (RGB). These methods and datasets are often constrained respectively by poor scalability, inconsistent spatio-temporal resolutions, overhead views or low spectral information. We present a novel method and its open implementation: a multi-spectral terrestrial-view dataset that circumvents these limitations. This dataset consists of 17,718 street level multi-spectral images captured with RGB, Near-infrared, and Thermal imaging sensors on bikes, across diverse urban morphologies (village, town, small city, and big urban area) in the Netherlands. Strict emphasis is put on data calibration and quality while also providing the details of our data collection methodology (including the hardware and software details). To the best of our knowledge, Spectrascapes is the first open-access dataset of its kind. Finally, we demonstrate two downstream use-cases enabled using this dataset and provide potential research directions in the machine learning, urban planning and remote sensing domains.</summary>
    <content type="html">&lt;p&gt;&lt;strong&gt;Publication:&lt;/strong&gt; Submitted, under-review&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Category:&lt;/strong&gt; Dataset&lt;/p&gt;</content>
    <category term="Computer Vision" />
    <category term="Machine Learning" />
  </entry>
  <entry>
    <title>Conflated Inverse Modeling to Generate Diverse and Temperature-Change Inducing Urban Vegetation Patterns</title>
    <link href="http://arxiv.org/abs/2604.13028v1" rel="alternate" type="text/html" />
    <id>http://arxiv.org/abs/2604.13028v1</id>
    <published>2026-04-14T00:00:00Z</published>
    <updated>2026-04-14T00:00:00Z</updated>
    <author>
      <name>Baris Sarper Tezcan</name>
    </author>
    <author>
      <name>Hrishikesh Viswanath</name>
    </author>
    <author>
      <name>Rubab Saher</name>
    </author>
    <author>
      <name>Daniel Aliaga</name>
    </author>
    <summary type="text">Urban areas are increasingly vulnerable to thermal extremes driven by rapid urbanization and climate change. Traditionally, thermal extremes have been monitored using Earth-observing satellites and numerical modeling frameworks. For example, land surface temperature derived from Landsat or Sentinel imagery is commonly used to characterize surface heating patterns. These approaches operate as forward models, translating radiative observations or modeled boundary conditions into estimates of surface thermal states. While forward models can predict land surface temperature from vegetation and urban form, the inverse problem of determining spatial vegetation configurations that achieve a desired regional temperature shift remains largely unexplored. This task is inherently underdetermined, as multiple spatial vegetation patterns can yield similar aggregated temperature responses. Conventional regression and deterministic neural networks fail to capture this ambiguity and often produce averaged solutions, particularly under data-scarce conditions. We propose a conflated inverse modeling framework that combines a predictive forward model with a diffusion-based generative inverse model to produce diverse, physically plausible image-based vegetation patterns conditioned on specific temperature goals. Our framework maintains control over thermal outcomes while enabling diverse spatial vegetation configurations, even when such combinations are absent from training data. Altogether, this work introduces a controllable inverse modeling approach for urban climate adaptation that accounts for the inherent diversity of the problem. Code is available at the GitHub repository.</summary>
    <content type="html">&lt;p&gt;&lt;strong&gt;Publication:&lt;/strong&gt; CVPR 2026&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Category:&lt;/strong&gt; Method&lt;/p&gt;</content>
    <category term="Computer Vision" />
  </entry>
</feed>