NOW LET US – AI RAG SaaS Studio TP.HCM
NOW LET US
Digital Product Studio
Back to news
AGENTIC-SYSTEMS...1 min read

Minimizing the Hidden Cost of Scales: Graph-Guided Ultra-Low-Bit Quantization for Large Language Models

Share
NOW LET US Article – Minimizing the Hidden Cost of Scales: Graph-Guided Ultra-Low-Bit Quantization for Large Language Models

Researchers have proposed SAGE-PTQ, a novel ultra-low-bit post-training quantization framework for LLMs that minimizes hidden scaling overhead. It significantly reduces GPU memory usage and accelerates decoding speed while maintaining high accuracy compared to existing methods.

Computer Science > Artificial Intelligence

Title:Minimizing the Hidden Cost of Scales: Graph-Guided Ultra-Low-Bit Quantization for Large Language Models

View PDF HTML (experimental)Abstract:Post-training quantization (PTQ) is critical for the efficient deployment of large language models (LLMs). Recent ultra-low-bit PTQ methods rely on rigid weight-saliency assumptions or position heuristics, introducing substantial hidden scaling overhead. We propose SAGE-PTQ (Saliency-Aware Graph-guided Efficient PTQ), a novel ultra-low-bit quantization framework for LLMs that minimizes hidden scaling cost. SAGE-PTQ separates salient and unsalient weights using distributional statistics, then models subsampled unsalient weights as a sparse graph to estimate the optimal number of groups per layer. SAGE-PTQ applies dual-mode quantization, assigning multi-bit precision to salient weights and binarizing unsalient weights. To reduce scaling overhead, SAGE-PTQ uses one per-channel scale for salient weights and one scalar per unsalient group. Finally, SAGE-PTQ implements adaptive saliency thresholding to select the optimal saliency ratio per matrix. SAGE-PTQ achieves 1.03 weight bits and only 0.004 scaling bits per matrix on average, outperforming state-of-the-art methods such as BiLLM and PB-LLM. On LLaMA-3-8B, SAGE-PTQ achieves 6.74 WikiText2 perplexity, compared to 55.8 for BiLLM, while using less than 50% of BiLLM's GPU memory. On LLaMA-2-70B, SAGE-PTQ provides 1.5x faster decoding on one NVIDIA L40 GPU, demonstrating practical inference efficiency.

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.

Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.

Have an idea for a project that will add value for arXiv's community? Learn more about arXivLabs.

© 2026 Now Let Us. All rights reserved.

Source: arXiv cs.AI Recent

Advertisement
Ad slot ready: 5887729102

More in this category

NOW LET US Related – Agents' Last Exam

agentic-systems

Agents' Last Exam

A new benchmark called "Agents' Last Exam" (ALE) has been introduced to evaluate AI agents on long-horizon, economically valuable, real-world tasks, revealing that current models achieve an average pass rate of just 2.6% on the hardest tier.

NOW LET US Related – A Motivational Architecture for Conversational AGI

agentic-systems

A Motivational Architecture for Conversational AGI

This paper proposes a conversational reinterpretation of the OpenPsi motivational lineage, coupled to MetaMo's higher-level motivational scaffold, for agents built on a modular execution substrate.

NOW LET US Related – Assessing the Carbon Emissions and Energy Consumption of U.S. Hyperscale Data Centers

agentic-systems

Assessing the Carbon Emissions and Energy Consumption of U.S. Hyperscale Data Centers

A new study analyzing 403 hyperscale data centers in the US reveals that the AI boom is driving electricity consumption and carbon emissions to alarming levels, with their carbon intensity averaging 48% higher than the national grid average.

NOW LET US Related – An interpretable and trustworthy AI framework for large-scale longitudinal structure-pain association studies using data from the Osteoarthritis Initiative (OAI)

agentic-systems

An interpretable and trustworthy AI framework for large-scale longitudinal structure-pain association studies using data from the Osteoarthritis Initiative (OAI)

Researchers have developed an interpretable and trustworthy AI framework to study the relationship between knee joint structural abnormalities and pain progression. By combining deep learning with advanced statistical modeling, this framework significantly improves prediction accuracy and clinical reliability.

NOW LET US Related – Brick-Composer: Using MLLMs for Assembly with Diverse Bricks

agentic-systems

Brick-Composer: Using MLLMs for Assembly with Diverse Bricks

Researchers introduce Brick-Composer, a learning framework that equips multimodal large language models (MLLMs) with spatial reasoning and visual grounding capabilities for brick assembly, significantly improving their construction accuracy.

NOW LET US Related – Ten Headache Specialists versus Artificial Intelligence for Clinical Literature Summarization: A Critical Evaluation and Comparison

agentic-systems

Ten Headache Specialists versus Artificial Intelligence for Clinical Literature Summarization: A Critical Evaluation and Comparison

A new study compared three state-of-the-art LLMs (GPT-4o, Claude Sonnet, and Llama 3.1) against ten medical specialists in summarizing clinical literature. While expert-written summaries remain preferred, the study reveals that distinguishing between human- and AI-generated medical content is becoming increasingly difficult.

EXPLORE TOPICS

Discover All Categories

Deep dive into the specific technology sectors that matter most to you.