AGENTIC-SYSTEMSApril 17, 20261 min read11 views

Reward Design for Physical Reasoning in Vision-Language Models

A systematic study explores how different reward designs in GRPO-based training can enhance the physical reasoning capabilities of Vision-Language Models (VLMs). The research introduces a novel attention-based reward that significantly boosts spatial reasoning without requiring manual spatial annotations.

Computer Science > Artificial Intelligence

Title: Reward Design for Physical Reasoning in Vision-Language Models

Physical reasoning over visual inputs demands tight integration of visual perception, domain knowledge, and multi-step symbolic inference. Yet even state-of-the-art Vision Language Models (VLMs) fall far short of human performance on physics benchmarks. While post-training algorithms such as Supervised Fine-Tuning (SFT) and Group Relative Policy Optimization (GRPO) have demonstrated strong reasoning gains in language models, how reward design shapes VLM physical reasoning behavior remains poorly understood.

We present a systematic reward ablation study for GRPO-based VLM training on physical reasoning. We compare four reward signals of increasing semantic richness: format compliance, answer accuracy, a composite rubric reward (answer correctness, physics principle identification, and unit consistency), and a novel internal reward derived from model attention weights over input image regions.

We evaluate on PhyX, a 3,000-problem benchmark spanning six physics domains and six reasoning types across multiple-choice and open-ended formats, using IBM Granite Vision 3.3 (2B). Across both formats, GRPO with accuracy-based rewards outperforms SFT on most domains, though gains vary substantially by reward type and domain. Reward design does not uniformly improve performance. Instead, it induces domain-specific reasoning behaviors. Accuracy-based rewards provide the strongest overall gains. Rubric rewards improve structured reasoning quality without consistent accuracy improvements.

Attention-based rewards enhance spatial reasoning while degrading performance in symbolic domains. Our internal attention-weight reward requires no spatial annotations and improves spatial relation accuracy from 0.27 to 0.50, suggesting that supervising where the model attends during generation is a promising direction for visually grounded physical reasoning.

Source: arXiv cs.AI Recent

More in this category

agentic-systems

Introducing Gemini 3.6 Flash, 3.5 Flash-Lite, and 3.5 Flash Cyber

Google has introduced Gemini 3.6 Flash, 3.5 Flash-Lite, and 3.5 Flash Cyber, designed to deliver higher token efficiency, lower latency, and enhanced capability for scaling agentic AI workflows.

agentic-systems

Generative Ontology Induction: Domain-Agnostic Schema Discovery from Document Corpora Using Large Language Models

Researchers introduce Generative Ontology Induction (GOI), a domain-agnostic framework that automatically extracts structured ontologies from document corpora using LLMs. Achieving 95-100% structural coverage, GOI addresses a major bottleneck in knowledge-intensive AI systems.

agentic-systems

JUMP: Single-Pass Membership Inference on Fine-Tuned Diffusion Language Models

Researchers have proposed JUMP, a novel single-pass membership inference attack designed for fine-tuned discrete diffusion language models (dLLMs). By leveraging the unique properties of dLLMs, JUMP significantly improves detection accuracy while drastically reducing the number of required queries.

agentic-systems

Masked Diffusion Language Models are Strong and Steerable Text-Based World Models for Agentic RL

Researchers demonstrate that Masked Diffusion Language Models (MDLMs) serve as highly effective, steerable text-based world models for agentic reinforcement learning. By leveraging bidirectional denoising, MDLMs outperform autoregressive models four times their size in coherence, groundedness, and rollout diversity.

NOW LET US Related – Democratizing AI with Small Language Models: Structured Benchmarking and Parameter-Efficient Fine-Tuning for Local Deployment

agentic-systems

Democratizing AI with Small Language Models: Structured Benchmarking and Parameter-Efficient Fine-Tuning for Local Deployment

A new study demonstrates that small language models (SLMs) under 3 billion parameters can serve as highly capable local experts for specialized tasks. By combining structured benchmarking with low-cost parameter-efficient fine-tuning (PEFT), institutions can achieve AI autonomy without relying on expensive hardware.

agentic-systems

PlanFlip: Attacking Multi-Agent LLM Systems via Planning-Phase Prompt Injection

Researchers have introduced PlanFlip, a novel prompt injection attack framework targeting the planning phase of multi-agent LLM systems. The study reveals critical security blind spots in homogeneous agent pipelines and demonstrates that reasoning-augmented models like DeepSeek-R1 exhibit strong resistance.