AGENTIC-SYSTEMSJune 6, 20261 min read10 views

Minimizing the Hidden Cost of Scales: Graph-Guided Ultra-Low-Bit Quantization for Large Language Models

Researchers have proposed SAGE-PTQ, a novel ultra-low-bit post-training quantization framework for LLMs that minimizes hidden scaling overhead. It significantly reduces GPU memory usage and accelerates decoding speed while maintaining high accuracy compared to existing methods.

Computer Science > Artificial Intelligence

Title:Minimizing the Hidden Cost of Scales: Graph-Guided Ultra-Low-Bit Quantization for Large Language Models

View PDF HTML (experimental)Abstract:Post-training quantization (PTQ) is critical for the efficient deployment of large language models (LLMs). Recent ultra-low-bit PTQ methods rely on rigid weight-saliency assumptions or position heuristics, introducing substantial hidden scaling overhead. We propose SAGE-PTQ (Saliency-Aware Graph-guided Efficient PTQ), a novel ultra-low-bit quantization framework for LLMs that minimizes hidden scaling cost. SAGE-PTQ separates salient and unsalient weights using distributional statistics, then models subsampled unsalient weights as a sparse graph to estimate the optimal number of groups per layer. SAGE-PTQ applies dual-mode quantization, assigning multi-bit precision to salient weights and binarizing unsalient weights. To reduce scaling overhead, SAGE-PTQ uses one per-channel scale for salient weights and one scalar per unsalient group. Finally, SAGE-PTQ implements adaptive saliency thresholding to select the optimal saliency ratio per matrix. SAGE-PTQ achieves 1.03 weight bits and only 0.004 scaling bits per matrix on average, outperforming state-of-the-art methods such as BiLLM and PB-LLM. On LLaMA-3-8B, SAGE-PTQ achieves 6.74 WikiText2 perplexity, compared to 55.8 for BiLLM, while using less than 50% of BiLLM's GPU memory. On LLaMA-2-70B, SAGE-PTQ provides 1.5x faster decoding on one NVIDIA L40 GPU, demonstrating practical inference efficiency.

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.

Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.

Have an idea for a project that will add value for arXiv's community? Learn more about arXivLabs.

Source: arXiv cs.AI Recent

More in this category

agentic-systems

Generative Ontology Induction: Domain-Agnostic Schema Discovery from Document Corpora Using Large Language Models

Researchers introduce Generative Ontology Induction (GOI), a domain-agnostic framework that automatically extracts structured ontologies from document corpora using LLMs. Achieving 95-100% structural coverage, GOI addresses a major bottleneck in knowledge-intensive AI systems.

agentic-systems

JUMP: Single-Pass Membership Inference on Fine-Tuned Diffusion Language Models

Researchers have proposed JUMP, a novel single-pass membership inference attack designed for fine-tuned discrete diffusion language models (dLLMs). By leveraging the unique properties of dLLMs, JUMP significantly improves detection accuracy while drastically reducing the number of required queries.

agentic-systems

Masked Diffusion Language Models are Strong and Steerable Text-Based World Models for Agentic RL

Researchers demonstrate that Masked Diffusion Language Models (MDLMs) serve as highly effective, steerable text-based world models for agentic reinforcement learning. By leveraging bidirectional denoising, MDLMs outperform autoregressive models four times their size in coherence, groundedness, and rollout diversity.

NOW LET US Related – Democratizing AI with Small Language Models: Structured Benchmarking and Parameter-Efficient Fine-Tuning for Local Deployment

agentic-systems

Democratizing AI with Small Language Models: Structured Benchmarking and Parameter-Efficient Fine-Tuning for Local Deployment

A new study demonstrates that small language models (SLMs) under 3 billion parameters can serve as highly capable local experts for specialized tasks. By combining structured benchmarking with low-cost parameter-efficient fine-tuning (PEFT), institutions can achieve AI autonomy without relying on expensive hardware.

agentic-systems

PlanFlip: Attacking Multi-Agent LLM Systems via Planning-Phase Prompt Injection

Researchers have introduced PlanFlip, a novel prompt injection attack framework targeting the planning phase of multi-agent LLM systems. The study reveals critical security blind spots in homogeneous agent pipelines and demonstrates that reasoning-augmented models like DeepSeek-R1 exhibit strong resistance.

agentic-systems

Rater State Bias in RLHF Preference Data: An Audit Framework

A new study identifies a structured bias in Reinforcement Learning from Human Feedback (RLHF) caused by the psychological state of human raters. Under stress, raters' shifting preferences can propagate through reward modeling, potentially compromising AI policy optimization.