NOW LET US – AI RAG SaaS Studio TP.HCM
NOW LET US
Digital Product Studio
Back to news
DEV-TOOLS...1 min read

Falcon Perception

Share
NOW LET US Article – Falcon Perception

Falcon Perception is a 0.6B-parameter early-fusion Transformer for open-vocabulary grounding and segmentation. It outperforms SAM 3 on the SA-Co benchmark and introduces PBench for diagnostic performance analysis.

TL;DR— Falcon Perception is a 0.6B-parameter early-fusion Transformer for open-vocabulary grounding and segmentation from natural language prompts. The model processes image patches + text in one sequence using a hybrid attention mask, and produces variable numbers of instances with a small, structured token interface and lightweight output heads. On SA-Co, Falcon Perception reaches 68.0 Macro-F1 (vs. 62.3 for SAM 3) with the main remaining gap being presence calibration (MCC 0.64 vs. 0.82). We also introduce PBench, a diagnostic benchmark that breaks down performance by capability (attributes, OCR-guided disambiguation, spatial constraints, relations) and by dense long-context crowded scenes.

We also release Falcon OCR, a 0.3B-parameter model which reaches a score of 80.3 and 88.6 on the olmOCR benchmark and OmniDocBench respectively, while having the highest throughput of any open source OCR model.

Many open-vocabulary perception systems are built as modular pipelines: a (often frozen) vision backbone extracts features, a separate fusion/decoder stage combines them with language, and additional components handle matching and post-processing. We asked a simpler question: can a single early-fusion Transformer backbone handle both perception and language modeling, if we choose the right attention pattern, output interface, and training signal?

In our experiments, the answer is largely yes. At its core, Falcon Perception is a dense Transformer that processes image patches and text tokens in a shared parameter space from the first layer. We address the structural differences with a hybrid attention mask: Image tokens attend to all other image tokens bidirectionally, while Text and task tokens attend causally to everything before them.

We use a small structured interface, Chain-of-Perception, which decomposes each instance into three steps: <coord> → <size> → <seg>. This ordering is deliberate. Committing to geometry first reduces ambiguity, and makes the mask prediction step closer to pixel refinement conditioned on the resolved object.

We introduce PBench, a diagnostic benchmark that separates samples by the dominant capability required (Simple objects, Attributes, OCR-guided, Spatial understanding, Relations, and Crowdedness). Falcon Perception initializes via multi-teacher distillation from DINOv3 and SigLIP2. The training set is built through a multi-stage pipeline including hierarchical clustering, VLM-driven listing, and negative mining to combat hallucination.

© 2026 Now Let Us. All rights reserved.

Source: Hugging Face Blog

Advertisement
Ad slot ready: 5887729102

More in this category

NOW LET US Related – GLM 5.2 Is Out

dev-tools

GLM 5.2 Is Out

Zhipu AI has officially released GLM-5.2, its most powerful open-source model to date, featuring a 1M context window and advanced long-horizon task capabilities. The release underscores Zhipu's commitment to open-source AI and global scientific collaboration amid rising technological restrictions.

NOW LET US Related – Noise infusion banned from statistical products published by Census Bureau

dev-tools

Noise infusion banned from statistical products published by Census Bureau

The U.S. Department of Commerce has banned "noise infusion" from statistical products published by the Census Bureau, a decision that could have severe consequences for both data utility and privacy protection.

NOW LET US Related – Treating pancreatic tumours may have revealed cancer's master switch

dev-tools

Treating pancreatic tumours may have revealed cancer's master switch

A promising new drug called daraxonrasib has shown breakthrough results in treating pancreatic cancer, doubling median survival times. This achievement could pave the way for an entirely new class of cancer treatments.

NOW LET US Related – Every Frame Perfect

dev-tools

Every Frame Perfect

In UI design, perfection isn't just about the start and end states, but every single transition frame in between. Polishing these micro-interactions is key to building user trust.

NOW LET US Related – Leaving Mozilla

dev-tools

Leaving Mozilla

A poignant and candid reflection from a 15-year Mozilla veteran upon their departure. The author highlights the leadership's missteps in trying to emulate tech giants and urges Mozilla to return to its core values: community and uniqueness.

NOW LET US Related – Shepherd's Dog: A Game by the Most Dangerous AI Model

dev-tools

Shepherd's Dog: A Game by the Most Dangerous AI Model

A developer tested Anthropic's latest, supposedly 'too dangerous' AI model by asking it to build a long-held game idea in a single shot. The model succeeded, generating a complete 2,319-line game after a 45-minute reasoning session.

EXPLORE TOPICS

Discover All Categories

Deep dive into the specific technology sectors that matter most to you.