ECCV 2026

GUIDE

GUI Unbiasing via Instructional-video Driven Expertise

Resolving Domain Bias in GUI Agents through Real-Time Web Video Retrieval and Plug-and-Play Annotation

Rui Xie1,2, Zhi Gao2,3,✉, Chenrui Shi2,3, Zirui Shang2,3, Lu Chen1,✉, Qing Li2,✉

1Shanghai Jiao Tong University
2State Key Laboratory for General Artificial Intelligence, BIGAI  ·  3Beijing Institute of Technology

✉ Corresponding authors

Training-free, plug-and-play video-RAG that gives GUI agents domain expertise from web tutorials — +4.47 to +7.48 pp on OSWorld, no fine-tuning.

See it work

GUIDE in one loop

A task comes in, a tutorial is retrieved live, a VLM turns keyframes into knowledge, and the agent navigates to the right menu.

The agent is handed a task on software it barely saw in training.
+7.48 pp
best OSWorld gain
(Seed-1.8)
3
agent architectures
improved
+10–12 pp
cross-benchmark
(WindowsAgentArena)
0
model parameters
changed

Abstract

Large vision-language models give GUI agents strong general skills, yet they stumble on specific applications they rarely saw in training. This domain bias shows up two ways: agents do not know an app's operation workflow (planning) or its UI layout (grounding).

GUIDE is a training-free, plug-and-play framework that closes this gap by learning from web tutorial videos. A subtitle-driven Video-RAG pipeline searches YouTube live and progressively filters candidates through domain classification, topic extraction, and relevance matching. A fully automated VLM pairwise annotation pipeline then reads consecutive keyframes enhanced with UI element detection to infer transferable planning and grounding knowledge, injected directly into the agent's modules. On OSWorld, GUIDE delivers +4.47 to +7.48 percentage-point gains across three agent architectures and transfers cross-benchmark to WindowsAgentArena — all without modifying model weights.

The Problem

Domain bias is an alignment gap, not a capability gap

GUI agents have strong general reasoning and perception, but they lack familiarity with specific software. The model does not lack capability — it lacks domain knowledge. Conventional fixes such as manual annotation, expert rules, or domain-specific fine-tuning are costly, narrow, and cannot keep pace with continuously evolving interfaces.

Annotated GIMP Image menu marked as the wrong menu with no contrast control
Planning bias

Knows "adjust brightness," but reaches for Image → Adjustments (a Photoshop habit). In GIMP, contrast lives under Colors.

Annotated GIMP dialog with the Contrast slider boxed as the target among similar sliders
Grounding bias

Recognizes "a slider," but cannot pick out the correct Contrast control among several visually similar ones.

Method · How GUIDE works

Three collaborating agents, zero fine-tuning

A Retrieval Agent finds the right tutorial, an Annotation Agent turns it into knowledge, and that knowledge is injected plug-and-play into the downstream GUI agent.

GUIDE pipeline overview: Retrieval Agent, Annotation Agent, and Agent Integration
Overview of GUIDE. (1) A Retrieval Agent filters YouTube candidates via three subtitle-driven stages to select top-K videos. (2) An Annotation Agent applies VLM pairwise annotation on keyframe pairs with UI element graphs, topic, and subtitle context, producing planning and grounding knowledge. (3) Knowledge is injected into the GUI agent — supporting multi-agent (Mode A) and single-model (Mode B) architectures.
1

Subtitle-driven Video-RAG

An LLM turns the task into a query and pulls 50+ YouTube candidates. Subtitles narrate the steps and UI names, bridging into otherwise opaque video.

50 candidates → ≤2 videos
2

VLM pairwise annotation

Whisper, MOG2 keyframes, and OmniParser feed a VLM that compares frame pairs to infer each action in a transferable, coordinate-free format.

keyframes + OmniParser → JSON
3

Plug-and-play injection

An LLM splits each trajectory into two channels, injected into the planning and grounding modules as reference — never as directives.

{video_planning}{video_grounding}
Stage 1 · Domain Classification Stage 2 · Topic Extraction Stage 3 · Relevance Matching

Two knowledge channels

Planning tells what; Grounding tells where

Planning

The domain-specific operational workflow: what steps to take, in what order, and which menus and panels to navigate. Deliberately coordinate-free so it transfers across resolutions and layout versions.

  • Execution flow: coherent step sequences and stage objectives
  • Key considerations: distilled expert insight to avoid pitfalls
  • Coordinate-free abstraction, decoupled from display resolution
  • Accounts for ~86–91% of the total improvement

Grounding

Domain-specific UI element descriptions so the agent knows where to act — described by appearance and function rather than absolute coordinates, identifiable across interface states.

  • Catalogs up to 15 key interactive elements per video
  • Icon/control name, appearance & position, predicted function
  • Complementary +0.69–0.80 pp, strongest in dense UIs (GIMP, Calc)
  • Cuts exploration steps on success (VLC −5.0, Calc −3.7)

Results

Same knowledge, three very different agents

GUIDE is architecture-agnostic. Evaluated on OSWorld (361 tasks, 10 application domains), it improves every agent it plugs into.

OSWorld average score (%). Best per agent in bold; gain over baseline in green.
AgentType Baseline+ Planning + Plan. & Gnd. Gain
Seed-1.8Single-model 37.1443.9344.62+7.48
Qwen3-VL-8BSingle-model · 8B (open) 33.9038.9339.73+5.83
AgentS3Multi-agent · GPT-5.2 + Seed-1.8 50.1854.65+4.47
"Planning alone delivers ~86–91% of the gain; grounding is the complementary specialist — strongest where the UI is dense, as in GIMP and Calc."

Cross-benchmark transfer — WindowsAgentArena

The same coordinate-free knowledge transfers to native Windows widgets on 154 tasks, with no WAA-specific tuning.

BackboneBaseline+ GUIDEGain
Agents3 + GPT-5.249.0059.21+10.21
Qwen3-VL-32B-Instruct31.7044.16+12.46
Per-domain results & ablation controls
Seed-1.8 per-domain (OSWorld, %). + GUIDE = full Planning & Grounding.
ConfigChromeGIMPCalcImpressWriterOSThBrdVLCVSCodeMultiOverall
Baseline36.8726.9229.7943.0934.7745.8366.6747.0660.8726.8837.14
+ GUIDE47.7442.3148.9445.3156.5150.0073.3352.3265.2225.7444.62
AgentS3 per-domain (OSWorld, %). GPT-5.2 Worker + Seed-1.8 Grounding.
ConfigChromeGIMPCalcImpressWriterOSThBrdVLCVSCodeMultiOverall
Baseline41.1838.4651.0644.6252.1770.8373.3373.9173.9140.3250.18
+ GUIDE49.8553.8565.9646.8865.2270.8380.0056.2582.6137.1054.65
Retrieval matching controls — 300 videos, 3 annotators, 1.0/0.5/0.0 scale.
Retrieval modeMeanAcc. ≥ 0.51.0 / 0.5 / 0.0
Full GUIDE0.86796.00%77.33 / 18.67 / 4.00
Title-only0.78288.67%67.67 / 21.00 / 11.33
Random0.62880.33%45.33 / 35.00 / 19.67

Structured conversion matters. A Watch & Learn-style raw-trajectory control on the same Qwen3-VL-8B backbone scores 31.96% — below the 33.90% baseline — confirming the gain comes from converting tutorials into structured Planning/Grounding knowledge, not from raw video. Overall, successful tasks grow from 134 to 161 (+20.1%).

Retrieval coverage. Of 361 OSWorld tasks, 82.8% retrieve at least one relevant video and 42.7% of covered tasks retrieve a second for multi-perspective reference. GUI classification reaches 100% precision (94.3% accuracy); subtitle-driven topic extraction reaches mean 0.867 with 96% acceptable.

Qualitative · GIMP contrast task

Watching the knowledge guide the agent

Two-panel walkthrough: Planning redirects to the Colors menu; Grounding locates the Contrast slider
(a) Planning redirects the agent from the conventional "Image" menu to GIMP's correct "Colors" path. (b) Grounding identifies the Contrast slider among visually similar controls. The agent verifies against its own screenshot at every step.
Annotated GIMP Colors menu: numbered boxes mark Colors and Brightness-Contrast along the correct path
Planning knowledge

"In GIMP, contrast is under Colors, NOT the Image menu." The agent follows the workflow ① Colors → ② Brightness-Contrast instead of the Photoshop-style Image → Adjustments it would otherwise guess.

Open resources

Dataset, code & paper

🤗

Dataset

Annotated tutorial videos across 10 application domains, plus pre-computed Planning + Grounding knowledge for all 361 OSWorld tasks, ready for direct injection.

299 videos · 453 MP4s · 10 domains

Browse on Hugging Face
💻

Code

The full GUIDE pipeline (Video-RAG, annotation, knowledge injection) with OSWorld and WindowsAgentArena runners and reproduction scripts.

Apache-2.0 · pipeline + evaluation

View on GitHub
📄

Paper

The full ECCV 2026 paper with method details, all tables, ablations, and the human-evaluation protocol for retrieval and annotation quality.

ECCV 2026 · camera-ready

Read the PDF

Citation

BibTeX

@article{xie2026guide,
  title   = {{GUIDE}: Resolving Domain Bias in {GUI} Agents through
             Real-Time Web Video Retrieval and Plug-and-Play Annotation},
  author  = {Xie, Rui and Gao, Zhi and Shi, Chenrui and Shang, Zirui
             and Chen, Lu and Li, Qing},
  journal = {arXiv preprint arXiv:2603.26266},
  year    = {2026}
}