<?xml version='1.0' encoding='utf-8'?>
<feed xmlns="http://www.w3.org/2005/Atom">
  <title>RS-Paper-Hub — VLM Papers</title>
  <id>https://rspaper.top/output/feed_vlm.xml</id>
  <link href="https://rspaper.top/output/feed_vlm.xml" rel="self" type="application/atom+xml" />
  <link href="https://rspaper.top" rel="alternate" type="text/html" />
  <updated>2026-04-20T10:19:45Z</updated>
  <subtitle>Latest remote sensing papers (last 7 days) — 11 entries</subtitle>
  <author>
    <name>RS-Paper-Hub</name>
    <uri>https://rspaper.top</uri>
  </author>
  <entry>
    <title>From Articles to Canopies: Knowledge-Driven Pseudo-Labelling for Tree Species Classification using LLM Experts</title>
    <link href="http://arxiv.org/abs/2604.16115v1" rel="alternate" type="text/html" />
    <id>http://arxiv.org/abs/2604.16115v1</id>
    <published>2026-04-17T00:00:00Z</published>
    <updated>2026-04-17T00:00:00Z</updated>
    <author>
      <name>Michał Romaszewski</name>
    </author>
    <author>
      <name>Dominik Kopeć</name>
    </author>
    <author>
      <name>Michał Cholewa</name>
    </author>
    <author>
      <name>Katarzyna Kołodziej</name>
    </author>
    <author>
      <name>Przemysław Głomb</name>
    </author>
    <author>
      <name>Jan Niedzielko</name>
    </author>
    <author>
      <name>Jakub Charyton</name>
    </author>
    <author>
      <name>Justyna Wylazłowska</name>
    </author>
    <author>
      <name>Anna Jarocińska</name>
    </author>
    <summary type="text">Hyperspectral tree species classification is challenging due to limited and imbalanced class labels, spectral mixing (overlapping light signatures from multiple species), and ecological heterogeneity (variability among ecological systems). Addressing these challenges requires methods that integrate biological and structural characteristics of vegetation, such as canopy architecture and interspecific interactions, rather than relying solely on spectral signatures. This paper presents a biologically informed, semi-supervised deep learning method that integrates multi-sensor Earth observation data, specifically hyperspectral imaging (HSI) and airborne laser scanning (ALS), with expert, ecological knowledge. The approach relies on biologically inspired pseudo-labelling over a precomputed canopy graph, yielding accurate classification at low training cost. In addition, ecological priors on species cohabitation are automatically derived from reliable sources using large language models (LLMs) and encoded as a cohabitation matrix with likelihoods of species occurring together. These priors are incorporated into the pseudo-labelling strategy, effectively introducing expert knowledge into the model. Experiments on a real-world forest dataset demonstrate 5.6% improvement over the best reference method. Expert evaluation of cohabitation priors reveals high accuracy with differences no larger than 15%.</summary>
    <content type="html">&lt;p&gt;&lt;strong&gt;Category:&lt;/strong&gt; Method&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Tasks:&lt;/strong&gt; CLS&lt;/p&gt;</content>
    <category term="Computer Vision" />
  </entry>
  <entry>
    <title>PixDLM: A Dual-Path Multimodal Language Model for UAV Reasoning Segmentation</title>
    <link href="http://arxiv.org/abs/2604.15670v1" rel="alternate" type="text/html" />
    <id>http://arxiv.org/abs/2604.15670v1</id>
    <published>2026-04-17T00:00:00Z</published>
    <updated>2026-04-17T00:00:00Z</updated>
    <author>
      <name>Shuyan Ke</name>
    </author>
    <author>
      <name>Yifan Mei</name>
    </author>
    <author>
      <name>Changli Wu</name>
    </author>
    <author>
      <name>Yonghan Zheng</name>
    </author>
    <author>
      <name>Jiayi Ji</name>
    </author>
    <author>
      <name>Liujuan Cao</name>
    </author>
    <author>
      <name>Rongrong Ji</name>
    </author>
    <summary type="text">Reasoning segmentation has recently expanded from ground-level scenes to remote-sensing imagery, yet UAV data poses distinct challenges, including oblique viewpoints, ultra-high resolutions, and extreme scale variations. To address these issues, we formally define the UAV Reasoning Segmentation task and organize its semantic requirements into three dimensions: Spatial, Attribute, and Scene-level reasoning. Based on this formulation, we construct DRSeg, a large-scale benchmark for UAV reasoning segmentation, containing 10k high-resolution aerial images paired with Chain-of-Thought QA supervision across all three reasoning types. As a benchmark companion, we introduce PixDLM, a simple yet effective pixel-level multimodal language model that serves as a unified baseline for this task. Experiments on DRSeg establish strong baseline results and highlight the unique challenges of UAV reasoning segmentation, providing a solid foundation for future research.</summary>
    <content type="html">&lt;p&gt;&lt;strong&gt;Publication:&lt;/strong&gt; CVPR 2026&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Category:&lt;/strong&gt; Method&lt;/p&gt;</content>
    <category term="Computer Vision" />
  </entry>
  <entry>
    <title>Towards Realistic Open-Vocabulary Remote Sensing Segmentation: Benchmark and Baseline</title>
    <link href="http://arxiv.org/abs/2604.15652v1" rel="alternate" type="text/html" />
    <id>http://arxiv.org/abs/2604.15652v1</id>
    <published>2026-04-17T00:00:00Z</published>
    <updated>2026-04-17T00:00:00Z</updated>
    <author>
      <name>Bingyu Li</name>
    </author>
    <author>
      <name>Tao Huo</name>
    </author>
    <author>
      <name>Haocheng Dong</name>
    </author>
    <author>
      <name>Da Zhang</name>
    </author>
    <author>
      <name>Zhiyuan Zhao</name>
    </author>
    <author>
      <name>Junyu Gao</name>
    </author>
    <author>
      <name>Xuelong Li</name>
    </author>
    <summary type="text">Open-vocabulary remote sensing image segmentation (OVRSIS) remains underexplored due to fragmented datasets, limited training diversity, and the lack of evaluation benchmarks that reflect realistic geospatial application demands. Our previous \textit{OVRSISBenchV1} established an initial cross-dataset evaluation protocol, but its limited scope is insufficient for assessing realistic open-world generalization. To address this issue, we propose \textit{OVRSISBenchV2}, a large-scale and application-oriented benchmark for OVRSIS. We first construct \textbf{OVRSIS95K}, a balanced dataset of about 95K image--mask pairs covering 35 common semantic categories across diverse remote sensing scenes. Built upon OVRSIS95K and 10 downstream datasets, OVRSISBenchV2 contains 170K images and 128 categories, substantially expanding scene diversity, semantic coverage, and evaluation difficulty. Beyond standard open-vocabulary segmentation, it further includes downstream protocols for building extraction, road extraction, and flood detection, thereby better reflecting realistic geospatial application demands and complex deployment scenarios. We also propose \textbf{Pi-Seg}, a baseline for OVRSIS. Pi-Seg improves transferability through a \textbf{positive-incentive noise} mechanism, where learnable and semantically guided perturbations broaden the visual-text feature space during training. Extensive experiments on OVRSISBenchV1, OVRSISBenchV2, and downstream tasks show that Pi-Seg delivers strong and consistent results, particularly on the more challenging OVRSISBenchV2 benchmark. Our results highlight both the importance of realistic benchmark design and the effectiveness of perturbation-based transfer for OVRSIS. The code and datasets are available at \href{https://github.com/LiBingyu01/RSKT-Seg/tree/Pi-Seg}{LiBingyu01/RSKT-Seg/tree/Pi-Seg}.</summary>
    <content type="html">&lt;p&gt;&lt;strong&gt;Code:&lt;/strong&gt; &lt;a href="https://github.com/LiBingyu01/RSKT-Seg/tree/Pi-Seg"&gt;https://github.com/LiBingyu01/RSKT-Seg/tree/Pi-Seg&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Category:&lt;/strong&gt; Dataset&lt;/p&gt;</content>
    <category term="Computer Vision" />
  </entry>
  <entry>
    <title>OmniGCD: Abstracting Generalized Category Discovery for Modality Agnosticism</title>
    <link href="http://arxiv.org/abs/2604.14762v1" rel="alternate" type="text/html" />
    <id>http://arxiv.org/abs/2604.14762v1</id>
    <published>2026-04-16T00:00:00Z</published>
    <updated>2026-04-16T00:00:00Z</updated>
    <author>
      <name>Jordan Shipard</name>
    </author>
    <author>
      <name>Arnold Wiliem</name>
    </author>
    <author>
      <name>Kien Nguyen Thanh</name>
    </author>
    <author>
      <name>Wei Xiang</name>
    </author>
    <author>
      <name>Clinton Fookes</name>
    </author>
    <summary type="text">Generalized Category Discovery (GCD) challenges methods to identify known and novel classes using partially labeled data, mirroring human category learning. Unlike prior GCD methods, which operate within a single modality and require dataset-specific fine-tuning, we propose a modality-agnostic GCD approach inspired by the human brain's abstract category formation. Our $\textbf{OmniGCD}$ leverages modality-specific encoders (e.g., vision, audio, text, remote sensing) to process inputs, followed by dimension reduction to construct a $\textbf{GCD latent space}$, which is transformed at test-time into a representation better suited for clustering using a novel synthetically trained Transformer-based model. To evaluate OmniGCD, we introduce a $\textbf{zero-shot GCD setting}$ where no dataset-specific fine-tuning is allowed, enabling modality-agnostic category discovery. $\textbf{Trained once on synthetic data}$, OmniGCD performs zero-shot GCD across 16 datasets spanning four modalities, improving classification accuracy for known and novel classes over baselines (average percentage point improvement of $\textbf{+6.2}$, $\textbf{+17.9}$, $\textbf{+1.5}$ and $\textbf{+12.7}$ for vision, text, audio and remote sensing). This highlights the importance of strong encoders while decoupling representation learning from category discovery. Improving modality-agnostic methods will propagate across modalities, enabling encoder development independent of GCD. Our work serves as a benchmark for future modality-agnostic GCD works, paving the way for scalable, human-inspired category discovery. All code is available $\href{https://github.com/Jordan-HS/OmniGCD}{here}$</summary>
    <content type="html">&lt;p&gt;&lt;strong&gt;Code:&lt;/strong&gt; &lt;a href="https://github.com/Jordan-HS/OmniGCD"&gt;https://github.com/Jordan-HS/OmniGCD&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Publication:&lt;/strong&gt; CVPR 2026&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Category:&lt;/strong&gt; Method&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Tasks:&lt;/strong&gt; CLS&lt;/p&gt;</content>
    <category term="Computer Vision" />
  </entry>
  <entry>
    <title>UHR-BAT: Budget-Aware Token Compression Vision-Language model for Ultra-High-Resolution Remote Sensing</title>
    <link href="http://arxiv.org/abs/2604.13565v1" rel="alternate" type="text/html" />
    <id>http://arxiv.org/abs/2604.13565v1</id>
    <published>2026-04-15T00:00:00Z</published>
    <updated>2026-04-15T00:00:00Z</updated>
    <author>
      <name>Yunkai Dang</name>
    </author>
    <author>
      <name>Minxin Dai</name>
    </author>
    <author>
      <name>Yuekun Yang</name>
    </author>
    <author>
      <name>Zhangnan Li</name>
    </author>
    <author>
      <name>Wenbin Li</name>
    </author>
    <author>
      <name>Feng Miao</name>
    </author>
    <author>
      <name>Yang Gao</name>
    </author>
    <summary type="text">Ultra-high-resolution (UHR) remote sensing imagery couples kilometer-scale context with query-critical evidence that may occupy only a few pixels. Such vast spatial scale leads to a quadratic explosion of visual tokens and hinders the extraction of information from small objects. Previous works utilize direct downsampling, dense tiling, or global top-k pruning, which either compromise query-critical image details or incur unpredictable compute. In this paper, we propose UHR-BAT, a query-guided and region-faithful token compression framework to efficiently select visual tokens under a strict context budget. Specifically, we leverage text-guided, multi-scale importance estimation for visual tokens, effectively tackling the challenge of achieving precise yet low-cost feature extraction. Furthermore, by introducing region-wise preserve and merge strategies, we mitigate visual token redundancy, further driving down the computational budget. Experimental results show that UHR-BAT achieves state-of-the-art performance across various benchmarks. Code will be available at https://github.com/Yunkaidang/UHR.</summary>
    <content type="html">&lt;p&gt;&lt;strong&gt;Code:&lt;/strong&gt; &lt;a href="https://github.com/Yunkaidang/UHR"&gt;https://github.com/Yunkaidang/UHR&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Category:&lt;/strong&gt; Method&lt;/p&gt;</content>
    <category term="Computer Vision" />
    <category term="Artificial Intelligence" />
  </entry>
  <entry>
    <title>Decoding the Delta: Unifying Remote Sensing Change Detection and Understanding with Multimodal Large Language Models</title>
    <link href="http://arxiv.org/abs/2604.14044v1" rel="alternate" type="text/html" />
    <id>http://arxiv.org/abs/2604.14044v1</id>
    <published>2026-04-15T00:00:00Z</published>
    <updated>2026-04-15T00:00:00Z</updated>
    <author>
      <name>Xiaohe Li</name>
    </author>
    <author>
      <name>Jiahao Li</name>
    </author>
    <author>
      <name>Kaixin Zhang</name>
    </author>
    <author>
      <name>Yuqiang Fang</name>
    </author>
    <author>
      <name>Leilei Lin</name>
    </author>
    <author>
      <name>Hong Wang</name>
    </author>
    <author>
      <name>Haohua Wu</name>
    </author>
    <author>
      <name>Zide Fan</name>
    </author>
    <summary type="text">While Multimodal Large Language Models (MLLMs) excel in general vision-language tasks, their application to remote sensing change understanding is hindered by a fundamental "temporal blindness". Existing architectures lack intrinsic mechanisms for multi-temporal contrastive reasoning and struggle with precise spatial grounding. To address this, we first introduce Delta-QA, a comprehensive benchmark comprising 180k visual question-answering samples. Delta-QA unifies pixel-level segmentation and visual question answering across bi- and tri-temporal scenarios, structuring change interpretation into four progressive cognitive dimensions. Methodologically, we propose Delta-LLaVA, a novel MLLM framework explicitly tailored for multi-temporal remote sensing interpretation. It overcomes the limitations of naive feature concatenation through three core innovations: a Change-Enhanced Attention module that systematically isolates and amplifies visual differences, a Change-SEG module utilizing Change Prior Embedding to extract differentiable difference features as input for the LLM, and Local Causal Attention to prevent cross-temporal contextual leakage. Extensive experiments demonstrate that Delta-LLaVA decisively outperforms leading generalist MLLMs and specialized segmentation models in complex change deduction and high-precision boundary localization, establishing a unified framework for earth observation intelligence.</summary>
    <content type="html">&lt;p&gt;&lt;strong&gt;Category:&lt;/strong&gt; Method&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Tasks:&lt;/strong&gt; VQA;CD;VG&lt;/p&gt;</content>
    <category term="Computer Vision" />
  </entry>
  <entry>
    <title>SatBLIP: Context Understanding and Feature Identification from Satellite Imagery with Vision-Language Learning</title>
    <link href="http://arxiv.org/abs/2604.14373v1" rel="alternate" type="text/html" />
    <id>http://arxiv.org/abs/2604.14373v1</id>
    <published>2026-04-15T00:00:00Z</published>
    <updated>2026-04-15T00:00:00Z</updated>
    <author>
      <name>Xue Wu</name>
    </author>
    <author>
      <name>Shengting Cao</name>
    </author>
    <author>
      <name>Jiaqi Gong</name>
    </author>
    <summary type="text">Rural environmental risks are shaped by place-based conditions (e.g., housing quality, road access, land-surface patterns), yet standard vulnerability indices are coarse and provide limited insight into risk contexts. We propose SatBLIP, a satellite-specific vision-language framework for rural context understanding and feature identification that predicts county-level Social Vulnerability Index (SVI). SatBLIP addresses limitations of prior remote sensing pipelines-handcrafted features, manual virtual audits, and natural-image-trained VLMs-by coupling contrastive image-text alignment with bootstrapped captioning tailored to satellite semantics. We use GPT-4o to generate structured descriptions of satellite tiles (roof type/condition, house size, yard attributes, greenery, and road context), then fine-tune a satellite-adapted BLIP model to generate captions for unseen images. Captions are encoded with CLIP and fused with LLM-derived embeddings via attention for SVI estimation under spatial aggregation. Using SHAP, we identify salient attributes (e.g., roof form/condition, street width, vegetation, cars/open space) that consistently drive robust predictions, enabling interpretable mapping of rural risk environments.</summary>
    <content type="html">&lt;p&gt;&lt;strong&gt;Category:&lt;/strong&gt; Method&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Tasks:&lt;/strong&gt; IC&lt;/p&gt;</content>
    <category term="Computer Vision" />
    <category term="Artificial Intelligence" />
  </entry>
  <entry>
    <title>SatBLIP: Context Understanding and Feature Identification from Satellite Imagery with Vision-Language Learning</title>
    <link href="http://arxiv.org/abs/2604.14373v2" rel="alternate" type="text/html" />
    <id>http://arxiv.org/abs/2604.14373v2</id>
    <published>2026-04-15T00:00:00Z</published>
    <updated>2026-04-15T00:00:00Z</updated>
    <author>
      <name>Xue Wu</name>
    </author>
    <author>
      <name>Shengting Cao</name>
    </author>
    <author>
      <name>Shenglin Li</name>
    </author>
    <author>
      <name>Jiaqi Gong</name>
    </author>
    <summary type="text">Rural environmental risks are shaped by place-based conditions (e.g., housing quality, road access, land-surface patterns), yet standard vulnerability indices are coarse and provide limited insight into risk contexts. We propose SatBLIP, a satellite-specific vision-language framework for rural context understanding and feature identification that predicts county-level Social Vulnerability Index (SVI). SatBLIP addresses limitations of prior remote sensing pipelines-handcrafted features, manual virtual audits, and natural-image-trained VLMs-by coupling contrastive image-text alignment with bootstrapped captioning tailored to satellite semantics. We use GPT-4o to generate structured descriptions of satellite tiles (roof type/condition, house size, yard attributes, greenery, and road context), then fine-tune a satellite-adapted BLIP model to generate captions for unseen images. Captions are encoded with CLIP and fused with LLM-derived embeddings via attention for SVI estimation under spatial aggregation. Using SHAP, we identify salient attributes (e.g., roof form/condition, street width, vegetation, cars/open space) that consistently drive robust predictions, enabling interpretable mapping of rural risk environments.</summary>
    <content type="html">&lt;p&gt;&lt;strong&gt;Category:&lt;/strong&gt; Method&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Tasks:&lt;/strong&gt; IC&lt;/p&gt;</content>
    <category term="Computer Vision" />
    <category term="Artificial Intelligence" />
  </entry>
  <entry>
    <title>GTPBD-MM: A Global Terraced Parcel and Boundary Dataset with Multi-Modality</title>
    <link href="http://arxiv.org/abs/2604.12315v1" rel="alternate" type="text/html" />
    <id>http://arxiv.org/abs/2604.12315v1</id>
    <published>2026-04-14T00:00:00Z</published>
    <updated>2026-04-14T00:00:00Z</updated>
    <author>
      <name>Zhiwei Zhang</name>
    </author>
    <author>
      <name>Xingyuan Zeng</name>
    </author>
    <author>
      <name>Xinkai Kong</name>
    </author>
    <author>
      <name>Kunquan Zhang</name>
    </author>
    <author>
      <name>Haoyuan Liang</name>
    </author>
    <author>
      <name>Bohan Shi</name>
    </author>
    <author>
      <name>Juepeng Zheng</name>
    </author>
    <author>
      <name>Jianxi Huang</name>
    </author>
    <author>
      <name>Yutong Lu</name>
    </author>
    <author>
      <name>Haohuan Fu</name>
    </author>
    <summary type="text">Agricultural parcel extraction plays an important role in remote sensing-based agricultural monitoring, supporting parcel surveying, precision management, and ecological assessment. However, existing public benchmarks mainly focus on regular and relatively flat farmland scenes. In contrast, terraced parcels in mountainous regions exhibit stepped terrain, pronounced elevation variation, irregular boundaries, and strong cross-regional heterogeneity, making parcel extraction a more challenging problem that jointly requires visual recognition, semantic discrimination, and terrain-aware geometric understanding. Although recent studies have advanced visual parcel benchmarks and image-text farmland understanding, a unified benchmark for complex terraced parcel extraction under aligned image-text-DEM settings remains absent. To fill this gap, we present GTPBD-MM, the first multimodal benchmark for global terraced parcel extraction. Built upon GTPBD, GTPBD-MM integrates high-resolution optical imagery, structured text descriptions, and DEM data, and supports systematic evaluation under Image-only, Image+Text, and Image+Text+DEM settings. We further propose Elevation and Text guided Terraced parcel network (ETTerra), a multimodal baseline for terraced parcel delineation. Extensive experiments demonstrate that textual semantics and terrain geometry provide complementary cues beyond visual appearance alone, yielding more accurate, coherent, and structurally consistent delineation results in complex terraced scenes.</summary>
    <content type="html">&lt;p&gt;&lt;strong&gt;Publication:&lt;/strong&gt; 15 pages, 11 figures&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Category:&lt;/strong&gt; Dataset&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Tasks:&lt;/strong&gt; 3D&lt;/p&gt;</content>
    <category term="Computer Vision" />
    <category term="Multimedia" />
  </entry>
  <entry>
    <title>A Multi-Agent Feedback System for Detecting and Describing News Events in Satellite Imagery</title>
    <link href="http://arxiv.org/abs/2604.12772v1" rel="alternate" type="text/html" />
    <id>http://arxiv.org/abs/2604.12772v1</id>
    <published>2026-04-14T00:00:00Z</published>
    <updated>2026-04-14T00:00:00Z</updated>
    <author>
      <name>Madeline Anderson</name>
    </author>
    <author>
      <name>Mikhail Klassen</name>
    </author>
    <author>
      <name>Ash Hoover</name>
    </author>
    <author>
      <name>Kerri Cahoy</name>
    </author>
    <summary type="text">Changes in satellite imagery often occur over multiple time steps. Despite the emergence of bi-temporal change captioning datasets, there is a lack of multi-temporal event captioning datasets (at least two images per sequence) in remote sensing. This gap exists because (1) searching for visible events in satellite imagery and (2) labeling multi-temporal sequences require significant time and labor. To address these challenges, we present SkyScraper, an iterative multi-agent workflow that geocodes news articles and synthesizes captions for corresponding satellite image sequences. Our experiments show that SkyScraper successfully finds 5x more events than traditional geocoding methods, demonstrating that agentic feedback is an effective strategy for surfacing new multi-temporal events in satellite imagery. We apply our framework to a large database of global news articles, curating a new multi-temporal captioning dataset with 5,000 sequences. By automatically identifying imagery related to news events, our work also supports journalism and reporting efforts.</summary>
    <content type="html">&lt;p&gt;&lt;strong&gt;Category:&lt;/strong&gt; Method&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Tasks:&lt;/strong&gt; IC;CD&lt;/p&gt;</content>
    <category term="Computer Vision" />
    <category term="cs.MA" />
  </entry>
  <entry>
    <title>Optimizing Earth Observation Satellite Schedules under Unknown Operational Constraints: An Active Constraint Acquisition Approach</title>
    <link href="http://arxiv.org/abs/2604.13283v1" rel="alternate" type="text/html" />
    <id>http://arxiv.org/abs/2604.13283v1</id>
    <published>2026-04-14T00:00:00Z</published>
    <updated>2026-04-14T00:00:00Z</updated>
    <author>
      <name>Mohamed-Bachir Belaid</name>
    </author>
    <summary type="text">Earth Observation (EO) satellite scheduling (deciding which imaging tasks to perform and when) is a well-studied combinatorial optimization problem. Existing methods typically assume that the operational constraint model is fully specified in advance. In practice, however, constraints governing separation between observations, power budgets, and thermal limits are often embedded in engineering artefacts or high-fidelity simulators rather than in explicit mathematical models. We study EO scheduling under \emph{unknown constraints}: the objective is known, but feasibility must be learned interactively from a binary oracle. Working with a simplified model restricted to pairwise separation and global capacity constraints, we introduce Conservative Constraint Acquisition~(CCA), a domain-specific procedure designed to identify justified constraints efficiently in practice while limiting unnecessary tightening of the learned model. Embedded in the \textsc{Learn\&amp;Optimize} framework, CCA supports an interactive search process that alternates optimization under a learned constraint model with targeted oracle queries. On synthetic instances with up to 50~tasks and dense constraint networks, L\&amp;O improves over a no-knowledge greedy baseline and uses far fewer main oracle queries than a two-phase acquire-then-solve baseline (FAO). For $n\leq 30$, the average gap drops from 65--68\% (Priority Greedy) to 17.7--35.8\% using L\&amp;O. At $n{=}50$, where the CP-SAT reference is the best feasible solution found in 120~s, L\&amp;O improves on FAO on average (17.9\% vs.\ 20.3\%) while using 21.3 main queries instead of 100 and about $5\times$ less execution time.</summary>
    <content type="html">&lt;p&gt;&lt;strong&gt;Category:&lt;/strong&gt; Method&lt;/p&gt;</content>
    <category term="Artificial Intelligence" />
    <category term="Machine Learning" />
  </entry>
</feed>