GUI-Libra

Training Native GUI Agents to Reason and Act with Action-aware Supervision and Partially Verifiable RL

1UIUC   2Microsoft   3UNC-Chapel Hill
GUI-Libra Overview
Figure 1: Overview of GUI-Libra. Using existing open-source GUI trajectories, we tackle key limitations through action-aligned reasoning data curation, action-aware SFT, and conservative RL.

Overview

Open-source native GUI agents have made rapid progress in visual grounding and low-level control, but still fall short on long-horizon navigation tasks that require both high-level planning and precise execution. We pinpoint two limitations: scarce high-quality reasoning data aligned with actions, and post-training pipelines that ignore the specifics of GUI agents. In particular, long chain-of-thought SFT tends to degrade grounding accuracy, while step-wise RLVR-style training suffers from partial verifiability (multiple valid actions exist per step, yet only one is used for verification). GUI-Libra tackles these issues with a tailored recipe: an 81K curated reasoning dataset, action-aware SFT that balances reasoning and direct-action supervision with token reweighting, and conservative RL with KL regularization and success-adaptive negative gradient scaling. On both offline and online benchmarks (including AndroidWorld, Online-Mind2Web, and WebArena-Lite-v2), GUI-Libra-4B and GUI-Libra-8B consistently outperform strong native GUI agents and even proprietary models.

Key Contributions

81K Curated Dataset

Action-aligned reasoning data with construction and filtering pipeline. Agreement filtering via action re-prediction and coordinate alignment via bounding-box verification.

Action-Aware SFT

Mixing reasoning-then-action and direct-action supervision and action-aware token reweighting that emphasizes action and grounding tokens, mitigating CoT-induced grounding degradation.

Conservative RL

KL-regularized GRPO under partial verifiability improves offline-to-online predictability. Success-adaptive negative gradient scaling to reduce bias from ambiguously negative rewards.

Dataset Analysis

GUI-Libra-81K Dataset Analysis
Figure 3: (a)(b) Data source distribution for SFT and RL. (c) Action type distribution of GUI-Libra-81K. (d) Data filtering for RL mitigates early-step bias.

Results

GUI-Libra consistently improves both step-wise accuracy and end-to-end task completion across web and mobile benchmarks.

AndroidWorld (20 steps)

Acc. = final average success rate. * from original papers.

Model Acc. ฮ” vs Base
Qwen2.5-VL-3B (Baseline)3.5โ€”
GUI-Libra-3B (Ours)25.2+21.7
Qwen2.5-VL-7B (Baseline)7.8โ€”
GUI-Libra-7B (Ours)29.6+21.8
Qwen3-VL-4B (Baseline)27.0โ€”
GUI-Libra-4B (Ours)42.6+15.6
Qwen3-VL-8B (Baseline)30.4โ€”
GUI-Libra-8B (Ours)42.6+12.2
Representative baselines
UI-TARS-1.5-7B16.5โ€”
Qwen2.5-VL-32B29.6โ€”
Qwen2.5-VL-72B32.2โ€”
Qwen3-VL-32B34.8โ€”
GPT-4.1 + UGround-v1-7B37.4โ€”
GPT-5-mini + UGround-v1-7B40.9โ€”
GPT-4o + UGround-v1-7B42.6โ€”
GPT-5 + UGround-v1-7B48.7โ€”

WebArena-Lite-v2 (15 steps)

Average = final average success rate across GitLab, MAP, Reddit, Shopping, ShoppingAdmin. * from Liu et al. (2025c).

Model Average ฮ” vs Base
Qwen2.5-VL-3B (Baseline)0.8โ€”
GUI-Libra-3B (Ours)16.7+15.9
Qwen2.5-VL-7B (Baseline)4.9โ€”
GUI-Libra-7B (Ours)22.6+17.7
Qwen3-VL-4B (Baseline)11.9โ€”
GUI-Libra-4B (Ours)24.4+12.5
Qwen3-VL-8B (Baseline)15.3โ€”
GUI-Libra-8B (Ours)26.6+11.3
Representative baselines
Qwen2.5-VL-72B*15.6โ€”
UI-TARS-1.5-7B*20.8โ€”
ScaleCUA-7B23.9โ€”
UI-TARS-72B-DPO*23.4โ€”
ScaleCUA-32B24.0โ€”
GPT-4o + ScaleCUA-7B*28.6โ€”

Online-Mind2Web (30 steps)

o4-mini and WebJudge-7B as judges. Avg. Overall = average of both judges' Overall scores.

Model o4-mini Overall WebJudge-7B Overall Avg. Overall ฮ” vs Base
Qwen2.5-VL-3B (Baseline)1.38.34.8โ€”
GUI-Libra-3B (Ours)13.729.021.3+16.5
Qwen2.5-VL-7B (Baseline)9.722.015.8โ€”
GUI-Libra-7B (Ours)17.733.325.5+9.7
Qwen3-VL-4B (Baseline)15.727.721.7โ€”
GUI-Libra-4B (Ours)20.031.325.7+4.0
Qwen3-VL-8B (Baseline)11.027.719.3โ€”
GUI-Libra-8B (Ours)19.336.728.0+8.7
Representative baselines
Qwen2.5-VL-32B7.319.713.5โ€”
ScaleCUA-7B17.030.323.7โ€”
ScaleCUA-32B17.030.023.5โ€”
Qwen3-VL-32B19.334.326.8โ€”
GPT-4.1 + UGround-v1-7B22.736.729.7โ€”

Mitigating Grounding Degradation

Long chain-of-thought outputs tend to degrade grounding accuracy. Action-aware SFT (ASFT) with token reweighting (ฮฑa, ฮฑg) and direct-action data effectively preserves grounding under long CoT. GUI-Libra (ASFT+RL) fully mitigates the grounding degradation in Reason mode, outperforming the No-Reason mode despite generating longer CoT.

Grounding degradation mitigation: accuracy vs response length and model comparison
Figure 8 & Table 8: (Left) Grounding correctness vs response length. (Right) Ablations of ASFT and RL on average grounding accuracy and avg tokens in both Reason and No-Reason modes.

KL Regularization Improves Offline-to-Online Predictability

KL regularization substantially strengthens the correlation between offline and online performance by controlling the policy distribution and reward ambiguity, making offline evaluation a more reliable predictor.

KL regularization impact on offline-to-online predictability
Figure 10: (a) Online vs offline performance scatter plot (Pearson r = 0.76). (b) Comparison of correlation coefficients: KL regularization (blue) increases Pearson from 0.63 to 0.89 and Spearman from 0.53 to 0.83, improving predictability.

Case Study

Example trajectories of GUI-Libra on AndroidWorld and WebArena-Lite-v2. Swipe, use the arrows, or click the dots to switch.

โ† Swipe to view WebArena example โ†’

Citation

@misc{yang2026guilibratrainingnativegui,
      title={GUI-Libra: Training Native GUI Agents to Reason and Act with Action-aware Supervision and Partially Verifiable RL}, 
      author={Rui Yang and Qianhui Wu and Zhaoyang Wang and Hanyang Chen and Ke Yang and Hao Cheng and Huaxiu Yao and Baoling Peng and Huan Zhang and Jianfeng Gao and Tong Zhang},
      year={2026},
      eprint={2602.22190},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2602.22190}, 
}