ECCV 2026

GUIDE

GUI Unbiasing via Instructional-video Driven Expertise

Resolving Domain Bias in GUI Agents through Real-Time Web Video Retrieval and Plug-and-Play Annotation

Rui Xie^1,2, Zhi Gao^2,3,✉, Chenrui Shi^2,3, Zirui Shang^2,3, Lu Chen^1,✉, Qing Li^2,✉

¹Shanghai Jiao Tong University
²State Key Laboratory for General Artificial Intelligence, BIGAI · ³Beijing Institute of Technology

✉ Corresponding authors

Paper arXiv Code 🤗 Dataset BibTeX

Training-free, plug-and-play video-RAG that gives GUI agents domain expertise from web tutorials — +4.47 to +7.48 pp on OSWorld, no fine-tuning.

See it work

GUIDE in one loop

A task comes in, a tutorial is retrieved live, a VLM turns keyframes into knowledge, and the agent navigates to the right menu.

▶Adjust the brightness of this image in GIMP

The agent is handed a task on software it barely saw in training.

Keyframe s_t: GIMP Colors menu with OmniParser-detected UI elements boxed and numbered — OmniParser detects UI elements (cyan boxes) in each keyframe; the VLM compares the consecutive pair `s_t → s_t+1` with their element graphs `E_t, E_t+1` and subtitle context to infer the action — *"click Brightness-Contrast."*

Keyframe s_t+1: Brightness-Contrast dialog with OmniParser-detected UI elements boxed and numbered — OmniParser detects UI elements (cyan boxes) in each keyframe; the VLM compares the consecutive pair `s_t → s_t+1` with their element graphs `E_t, E_t+1` and subtitle context to infer the action — *"click Brightness-Contrast."*

Annotated: agent opens GIMP's Image menu, marked as having no contrast control

Annotated: agent follows the correct path through the Colors menu to Brightness-Contrast

+7.48 pp

best OSWorld gain
(Seed-1.8)

agent architectures
improved

+10–12 pp

cross-benchmark
(WindowsAgentArena)

model parameters
changed

Abstract

Large vision-language models give GUI agents strong general skills, yet they stumble on specific applications they rarely saw in training. This domain bias shows up two ways: agents do not know an app's operation workflow (planning) or its UI layout (grounding).

GUIDE is a training-free, plug-and-play framework that closes this gap by learning from web tutorial videos. A subtitle-driven Video-RAG pipeline searches YouTube live and progressively filters candidates through domain classification, topic extraction, and relevance matching. A fully automated VLM pairwise annotation pipeline then reads consecutive keyframes enhanced with UI element detection to infer transferable planning and grounding knowledge, injected directly into the agent's modules. On OSWorld, GUIDE delivers +4.47 to +7.48 percentage-point gains across three agent architectures and transfers cross-benchmark to WindowsAgentArena — all without modifying model weights.

The Problem

Domain bias is an alignment gap, not a capability gap

GUI agents have strong general reasoning and perception, but they lack familiarity with specific software. The model does not lack capability — it lacks domain knowledge. Conventional fixes such as manual annotation, expert rules, or domain-specific fine-tuning are costly, narrow, and cannot keep pace with continuously evolving interfaces.

Annotated GIMP Image menu marked as the wrong menu with no contrast control

Planning bias

Knows "adjust brightness," but reaches for Image → Adjustments (a Photoshop habit). In GIMP, contrast lives under Colors.

Annotated GIMP dialog with the Contrast slider boxed as the target among similar sliders

Grounding bias

Recognizes "a slider," but cannot pick out the correct Contrast control among several visually similar ones.

Method · How GUIDE works

Three collaborating agents, zero fine-tuning

A Retrieval Agent finds the right tutorial, an Annotation Agent turns it into knowledge, and that knowledge is injected plug-and-play into the downstream GUI agent.

GUIDE pipeline overview: Retrieval Agent, Annotation Agent, and Agent Integration — **Overview of GUIDE.** (1) A Retrieval Agent filters YouTube candidates via three subtitle-driven stages to select top-K videos. (2) An Annotation Agent applies VLM pairwise annotation on keyframe pairs with UI element graphs, topic, and subtitle context, producing planning and grounding knowledge. (3) Knowledge is injected into the GUI agent — supporting multi-agent (Mode A) and single-model (Mode B) architectures.

Subtitle-driven Video-RAG

An LLM turns the task into a query and pulls 50+ YouTube candidates. Subtitles narrate the steps and UI names, bridging into otherwise opaque video.

50 candidates → ≤2 videos

VLM pairwise annotation

Whisper, MOG2 keyframes, and OmniParser feed a VLM that compares frame pairs to infer each action in a transferable, coordinate-free format.

keyframes + OmniParser → JSON

Plug-and-play injection

An LLM splits each trajectory into two channels, injected into the planning and grounding modules as reference — never as directives.

{video_planning}{video_grounding}

✓ Stage 1 · Domain Classification → ✓ Stage 2 · Topic Extraction → ✓ Stage 3 · Relevance Matching

Two knowledge channels

Planning tells what; Grounding tells where

Planning

The domain-specific operational workflow: what steps to take, in what order, and which menus and panels to navigate. Deliberately coordinate-free so it transfers across resolutions and layout versions.

Execution flow: coherent step sequences and stage objectives
Key considerations: distilled expert insight to avoid pitfalls
Coordinate-free abstraction, decoupled from display resolution
Accounts for ~86–91% of the total improvement

Grounding

Domain-specific UI element descriptions so the agent knows where to act — described by appearance and function rather than absolute coordinates, identifiable across interface states.

Catalogs up to 15 key interactive elements per video
Icon/control name, appearance & position, predicted function
Complementary +0.69–0.80 pp, strongest in dense UIs (GIMP, Calc)
Cuts exploration steps on success (VLC −5.0, Calc −3.7)

Results

Same knowledge, three very different agents

GUIDE is architecture-agnostic. Evaluated on OSWorld (361 tasks, 10 application domains), it improves every agent it plugs into.

OSWorld average score (%). Best per agent in **bold**; gain over baseline in green.
Agent	Type	Baseline	+ Planning	+ Plan. & Gnd.	Gain
Seed-1.8	Single-model	37.14	43.93	44.62	+7.48
Qwen3-VL-8B	Single-model · 8B (open)	33.90	38.93	39.73	+5.83
AgentS3	Multi-agent · GPT-5.2 + Seed-1.8	50.18	—	54.65	+4.47

"Planning alone delivers ~86–91% of the gain; grounding is the complementary specialist — strongest where the UI is dense, as in GIMP and Calc."

Cross-benchmark transfer — WindowsAgentArena

The same coordinate-free knowledge transfers to native Windows widgets on 154 tasks, with no WAA-specific tuning.

Backbone	Baseline	+ GUIDE	Gain
Agents3 + GPT-5.2	49.00	59.21	+10.21
Qwen3-VL-32B-Instruct	31.70	44.16	+12.46

Per-domain results & ablation controls

**Seed-1.8** per-domain (OSWorld, %). + GUIDE = full Planning & Grounding.
Config	Chrome	GIMP	Calc	Impress	Writer	OS	ThBrd	VLC	VSCode	Multi	Overall
Baseline	36.87	26.92	29.79	43.09	34.77	45.83	66.67	47.06	60.87	26.88	37.14
+ GUIDE	47.74	42.31	48.94	45.31	56.51	50.00	73.33	52.32	65.22	25.74	44.62

**AgentS3** per-domain (OSWorld, %). GPT-5.2 Worker + Seed-1.8 Grounding.
Config	Chrome	GIMP	Calc	Impress	Writer	OS	ThBrd	VLC	VSCode	Multi	Overall
Baseline	41.18	38.46	51.06	44.62	52.17	70.83	73.33	73.91	73.91	40.32	50.18
+ GUIDE	49.85	53.85	65.96	46.88	65.22	70.83	80.00	56.25	82.61	37.10	54.65

**Retrieval matching controls** — 300 videos, 3 annotators, 1.0/0.5/0.0 scale.
Retrieval mode	Mean	Acc. ≥ 0.5	1.0 / 0.5 / 0.0
Full GUIDE	0.867	96.00%	77.33 / 18.67 / 4.00
Title-only	0.782	88.67%	67.67 / 21.00 / 11.33
Random	0.628	80.33%	45.33 / 35.00 / 19.67

Structured conversion matters. A Watch & Learn-style raw-trajectory control on the same Qwen3-VL-8B backbone scores 31.96% — below the 33.90% baseline — confirming the gain comes from converting tutorials into structured Planning/Grounding knowledge, not from raw video. Overall, successful tasks grow from 134 to 161 (+20.1%).

Retrieval coverage. Of 361 OSWorld tasks, 82.8% retrieve at least one relevant video and 42.7% of covered tasks retrieve a second for multi-perspective reference. GUI classification reaches 100% precision (94.3% accuracy); subtitle-driven topic extraction reaches mean 0.867 with 96% acceptable.

Qualitative · GIMP contrast task

Watching the knowledge guide the agent

Two-panel walkthrough: Planning redirects to the Colors menu; Grounding locates the Contrast slider — **(a) Planning** redirects the agent from the conventional "Image" menu to GIMP's correct "Colors" path. **(b) Grounding** identifies the Contrast slider among visually similar controls. The agent verifies against its own screenshot at every step.

Annotated GIMP Colors menu: numbered boxes mark Colors and Brightness-Contrast along the correct path

Planning knowledge

"In GIMP, contrast is under Colors, NOT the Image menu." The agent follows the workflow ① Colors → ② Brightness-Contrast instead of the Photoshop-style Image → Adjustments it would otherwise guess.

Open resources

Dataset, code & paper

🤗

Dataset

Annotated tutorial videos across 10 application domains, plus pre-computed Planning + Grounding knowledge for all 361 OSWorld tasks, ready for direct injection.

299 videos · 453 MP4s · 10 domains

Browse on Hugging Face

💻

Code

The full GUIDE pipeline (Video-RAG, annotation, knowledge injection) with OSWorld and WindowsAgentArena runners and reproduction scripts.

Apache-2.0 · pipeline + evaluation

View on GitHub

📄

Paper

The full ECCV 2026 paper with method details, all tables, ablations, and the human-evaluation protocol for retrieval and annotation quality.

ECCV 2026 · camera-ready

Read the PDF

Citation

BibTeX

@article{xie2026guide,
  title   = {{GUIDE}: Resolving Domain Bias in {GUI} Agents through
             Real-Time Web Video Retrieval and Plug-and-Play Annotation},
  author  = {Xie, Rui and Gao, Zhi and Shi, Chenrui and Shang, Zirui
             and Chen, Lu and Li, Qing},
  journal = {arXiv preprint arXiv:2603.26266},
  year    = {2026}
}