SemHash-LLM: A Multi-Granularity Semantic Hashing Framework for Document Deduplication

Researchers have introduced SemHash-LLM, a multi-granularity semantic hashing framework designed for efficient large-scale document deduplication. By combining LLMs with advanced hashing techniques, it reduces neural verification costs to under 1% while maintaining high accuracy.

Computer Science > Artificial Intelligence

Title:SemHash-LLM: A Multi-Granularity Semantic Hashing Framework for Document Deduplication

View PDF HTML (experimental)Abstract:Large scale document deduplication must preserve semantic equivalence while remaining efficient over massive corpora. We present SemHash LLM, a multi granularity framework that unifies semantic projection hashing, attention weighted MinHash, contrastive boundary learning, and selective LLM based adjudication. The method combines character, token, and document level signals through gated fusion, then applies a cascaded filtering pipeline for efficient candidate reduction. Semantic projection hashing learns compact binary codes in distilled LLM embedding space, while attention weighted Min- Hash suppresses boilerplate and emphasizes informative content. Adaptive decision boundaries and uncertainty estimation further improve robustness across template pollution, short text perturbation, containment, and viral fragments. Experiments show that SemHash LLM achieves strong duplicate detection quality with less than one percent neural verification cost.

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.

Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.

Have an idea for a project that will add value for arXiv's community? Learn more about arXivLabs.

Source: arXiv cs.AI Recent

SemHash-LLM: A Multi-Granularity Semantic Hashing Framework for Document Deduplication

Computer Science > Artificial Intelligence

Title:SemHash-LLM: A Multi-Granularity Semantic Hashing Framework for Document Deduplication

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

More in this category

Beyond Next-Token Prediction: An RLVR Proof of Concept for Tool-Use Agents on Atlassian Workflows

Discrete Diffusion Language Models for Interactive Radiology Report Drafting

Profit-Based Counterfactual Explanations for Product Improvement: A Case Study of Manga Sales in Japan

Safe and Adaptive Cloud Healing: Verifying LLM-Generated Recovery Plans with a Neural-Symbolic World Model

Hawk: Harnessing Hardware-Aware Knowledge for High-Performance NPU Kernel Generation

Scaling Trends for Lie Detector Oversight in Preference Learning

Discover All Categories