Falcon Perception

Falcon Perception is a 0.6B-parameter early-fusion Transformer for open-vocabulary grounding and segmentation. It outperforms SAM 3 on the SA-Co benchmark and introduces PBench for diagnostic performance analysis.

TL;DR— Falcon Perception is a 0.6B-parameter early-fusion Transformer for open-vocabulary grounding and segmentation from natural language prompts. The model processes image patches + text in one sequence using a hybrid attention mask, and produces variable numbers of instances with a small, structured token interface and lightweight output heads. On SA-Co, Falcon Perception reaches 68.0 Macro-F1 (vs. 62.3 for SAM 3) with the main remaining gap being presence calibration (MCC 0.64 vs. 0.82). We also introduce PBench, a diagnostic benchmark that breaks down performance by capability (attributes, OCR-guided disambiguation, spatial constraints, relations) and by dense long-context crowded scenes.

We also release Falcon OCR, a 0.3B-parameter model which reaches a score of 80.3 and 88.6 on the olmOCR benchmark and OmniDocBench respectively, while having the highest throughput of any open source OCR model.

Many open-vocabulary perception systems are built as modular pipelines: a (often frozen) vision backbone extracts features, a separate fusion/decoder stage combines them with language, and additional components handle matching and post-processing. We asked a simpler question: can a single early-fusion Transformer backbone handle both perception and language modeling, if we choose the right attention pattern, output interface, and training signal?

In our experiments, the answer is largely yes. At its core, Falcon Perception is a dense Transformer that processes image patches and text tokens in a shared parameter space from the first layer. We address the structural differences with a hybrid attention mask: Image tokens attend to all other image tokens bidirectionally, while Text and task tokens attend causally to everything before them.

We use a small structured interface, Chain-of-Perception, which decomposes each instance into three steps: <coord> → <size> → <seg>. This ordering is deliberate. Committing to geometry first reduces ambiguity, and makes the mask prediction step closer to pixel refinement conditioned on the resolved object.

We introduce PBench, a diagnostic benchmark that separates samples by the dominant capability required (Simple objects, Attributes, OCR-guided, Spatial understanding, Relations, and Crowdedness). Falcon Perception initializes via multi-teacher distillation from DINOv3 and SigLIP2. The training set is built through a multi-stage pipeline including hierarchical clustering, VLM-driven listing, and negative mining to combat hallucination.

Source: Hugging Face Blog