AGENTIC-SYSTEMSApril 14, 20261 min read12 views

MobiFlow: Real-World Mobile Agent Benchmarking through Trajectory Fusion

MobiFlow is a new evaluation framework for mobile agents that uses multi-trajectory fusion to benchmark performance on real-world third-party applications, overcoming the limitations of system-level API dependencies.

Computer Science > Artificial Intelligence

Title: MobiFlow: Real-World Mobile Agent Benchmarking through Trajectory Fusion

Mobile agents can autonomously complete user-assigned tasks through GUI interactions. However, existing mainstream evaluation benchmarks, such as AndroidWorld, operate by connecting to a system-level Android emulator and provide evaluation signals based on the state of system resources. In real-world mobile-agent scenarios, however, many third-party applications do not expose system-level APIs to determine whether a task has succeeded, leading to a mismatch between benchmarks and real-world usage and making it difficult to evaluate model performance accurately.

To address these issues, we propose MobiFlow, an evaluation framework built on tasks drawn from arbitrary third-party applications. Using an efficient graph-construction algorithm based on multi-trajectory fusion, MobiFlow can effectively compress the state space, support dynamic interaction, and better align with real-world third-party application scenarios.

MobiFlow covers 20 widely used third-party applications and comprises 240 diverse real-world tasks, with enriched evaluation metrics. Compared with AndroidWorld, MobiFlow's evaluation results show higher alignment with human assessments and can guide the training of future GUI-based models under real workloads.

Source: arXiv cs.AI Recent

More in this category

agentic-systems

Hybrid LSTM-Graph Neural Framework for Robust Financial Fraud Detection and Adversarial Resilience

Researchers proposed FraudShield AI, a novel hybrid framework combining LSTM networks and graph topological features to combat sophisticated financial fraud. The model shifts fraud detection to network-level forensics, significantly outperforming traditional algorithms like XGBoost.

agentic-systems

AdaRoPE: Not All Attention Heads Should Rotate and Scale Equally

Researchers have introduced AdaRoPE, a novel position embedding method that equips individual attention heads in Transformer models with learnable rotation frequencies and scaling factors, significantly improving context extension capabilities.

agentic-systems

Probabilistic Concept-Aware Steering for Trustworthy LLM Inference

Researchers have introduced Probabilistic Concept-Aware Steering (PCS), an inference-time intervention framework for LLMs. PCS provides fine-grained, safety-oriented semantic steering while preserving original task competence.

agentic-systems

S2T-RLHF: Hierarchical Credit Assignment for Stable Preference-Based RLHF

Researchers introduced S2T-RLHF, a hierarchical credit assignment framework that stabilizes preference-based RLHF training by decomposing response-level rewards into sentence and token levels.

agentic-systems

ProbSPARQL: Querying Knowledge Graphs with Multi-dimensional, Uncertain Numeric Data

Researchers have introduced ProbSPARQL, an upward-compatible SPARQL extension designed to query multi-dimensional and uncertain numeric measurement data within Knowledge Graphs, providing significant query performance gains for complex industrial applications.

agentic-systems

MUX: Continuous Reasoning via Multiplexed Tokens

Researchers introduced MUX, a novel latent reasoning method that compresses discrete text-based reasoning steps into continuous multiplexed tokens. By enabling lossless superposition, MUX significantly boosts LLM reasoning efficiency and speed across complex problem-solving tasks.