A Quantization-Aware Optimization Framework for Efficient Deep Neural Network Inference
Authors
School of Computer Science, Nanjing University of Information Science and Technology, Nanjing (China)
Article Information
DOI: 10.47772/IJRISS.2026.10100169
Subject Category: Computer Science
Volume/Issue: 10/1 | Page No: 2141-2156
Publication Timeline
Submitted: 2026-01-15
Accepted: 2026-01-20
Published: 2026-01-29
Abstract
The growing demand for deploying deep neural network (DNN) inference on resource-constrained platforms has intensified challenges related to computational cost, memory footprint, and energy efficiency [1], [2]. Quantization is widely adopted to address these constraints; however, conventional low- bit quantization methods often suffer from severe accuracy degradation, commonly referred to as the performance cliff phe-nomenon [3], [4].
In this work, we propose a unified Quantization-Aware Optimization Framework (QAOF) that bridges high-precision floating-point training and efficient integer-only inference. The framework incorporates a multi-level, layer-wise sensitivity analysis based on the average Hessian trace to characterize loss curvature and guide precision allocation across the network [5]. To mitigate accuracy loss caused by inter-channel and inter-layer distribution mismatch in hybrid architectures, we further introduce Quantization-Aware Distribution Scaling (QADS), which adaptively aligns weight and activation distributions prior to quantization. In addition, computationally expensive operations are replaced with piecewise linear, integer-friendly formulations to ena-ble efficient execution on low-power hardware [6].
Extensive evaluations on representative architectures, including ResNet, MobileNet, and Vision Transformers (ViT), demonstrate that QAOF achieves substantial efficiency gains with minimal accuracy impact. Across standard benchmarks, the proposed method delivers up to 4.2× inference speedup and up to 75% memory reduction, while maintaining accuracy loss below 0.4%. Finally, we provide practical guidelines for selecting between post-training quantization and quantization aware training under diverse hardware deployment scenar-ios [7], [8].
Keywords
Quantization, Aware, Optimization , Framework
Downloads
References
1. A. Gholami et al., “A survey of quantization methods for efficient neural network inference,” arXiv:2103.13630, 2021. [Google Scholar] [Crossref]
2. M. Nagel et al., “Mixed precision quantization: A survey,” IEEE Access, 2021. [Google Scholar] [Crossref]
3. Z. Yao et al., “HAWQ-V2: Hessian-aware trace-weighted quantization of neural networks,” in Proc. NeurIPS, 2020. [Google Scholar] [Crossref]
4. S. Uhlich et al., “Bit-width search for mixed-precision neural networks,” in Proc. ICLR, 2020. [Google Scholar] [Crossref]
5. S. Esser et al., “LSQ: Learned step size quantization,” in Proc. ICLR, 2020. [Google Scholar] [Crossref]
6. H. Cai, L. Zhu, and S. Han, “Once-for-all: Train one network and specialize it for efficient deploy-ment,” in Proc. ICLR, 2020. [Google Scholar] [Crossref]
7. T.-J. Yang, Y.-H. Chen, and V. Sze, “Hardware-aware neural architecture search: A survey,” ACM Computing Surveys, 2020. [Google Scholar] [Crossref]
8. K. Wang, Z. Liu, and J. Lin, “Joint neural architecture and quantization search,” in Proc. CVPR, 2020. [Google Scholar] [Crossref]
9. M. Rusci et al., “Post-training quantization for deep neural networks on microcontrollers,” in Proc. DATE, 2020. [Google Scholar] [Crossref]
10. J. Park and W. Sung, “Efficient low-bit neural network inference with INT4 precision,” in NeurIPS Workshops, 2020. [Google Scholar] [Crossref]
11. A. Dosovitskiy et al., “An image is worth 16×16 words: Transformers for image recognition at scale,” in Proc. ICLR, 2021. [Google Scholar] [Crossref]
12. Y. Li and T. Chen, “On the pitfalls of differentiable mixed-precision quantization,” arXiv:2203.01245, 2022. [Google Scholar] [Crossref]
13. NVIDIA, “Tensor RT: High-performance deep learning inference platform,” NVIDIA Developer Doc-umentation, 2022. [Google Scholar] [Crossref]
14. Intel, “Open VINO toolkit: Optimizing deep learning inference on CPUs,” Intel White Paper, 2021. [Google Scholar] [Crossref]
15. T. Dettmers and L. Zettlemoyer, “Outlier-aware quantization for transformer models,” in Proc. ICLR, 2023. [Google Scholar] [Crossref]
16. G. Xiao et al., “SmoothQuant: Accurate and efficient post-training quantization for large language models,” in Proc. ICML, 2023. [Google Scholar] [Crossref]
17. Y.-H. Chen, J. Emer, and V. Sze, “Efficient deployment of deep neural networks on heterogeneous hardware,” IEEE Micro, 2021. [Google Scholar] [Crossref]
18. J. Frantar et al., “GPTQ: Accurate post-training quantization for generative pre-trained transformers,” in Proc. NeurIPS, 2022. [Google Scholar] [Crossref]
19. H. Liu et al., “Mixed-precision post-training quantization for neural networks,” IEEE Transactions on Neural Networks and Learning Systems, 2021. [Google Scholar] [Crossref]
20. S. Lin et al., “MCUNet: Tiny deep learning on IoT devices,” in Proc. NeurIPS, 2020. [Google Scholar] [Crossref]
21. Q. Chen et al., “Fully integer quantization for deep neural networks,” IEEE Transactions on Circuits and Systems for Video Technology, 2021. [Google Scholar] [Crossref]
22. R. Jain et al., “Accurate and efficient post-training quantization for vision transformers,” in Proc. CVPR, 2022. [Google Scholar] [Crossref]
23. S. Migacz, “8-bit inference with TensorRT,” NVIDIA GTC Technical Report, 2020. [Google Scholar] [Crossref]
24. J. Lin et al., “Q-ViT: Accurate and fully quantized low-bit vision transformer,” in Proc. NeurIPS, 2022. [Google Scholar] [Crossref]
25. H. Wei et al., “ActQ: Activation-aware weight quantization for transformers,” in Proc. ICLR, 2023. [Google Scholar] [Crossref]
26. Y. Sheng et al., “AWQ: Activation-aware weight quantization for LLM compression and acceleration,” in Proc. MLSys, 2024. [Google Scholar] [Crossref]
27. Z. Liu et al., “Rethinking outlier suppression noticing activation sparsity in transformer quantization,” in Proc. ICML, 2024. [Google Scholar] [Crossref]
Metrics
Views & Downloads
Similar Articles
- What the Desert Fathers Teach Data Scientists: Ancient Ascetic Principles for Ethical Machine-Learning Practice
- Comparative Analysis of Some Machine Learning Algorithms for the Classification of Ransomware
- Comparative Performance Analysis of Some Priority Queue Variants in Dijkstra’s Algorithm
- Transfer Learning in Detecting E-Assessment Malpractice from a Proctored Video Recordings.
- Dual-Modal Detection of Parkinson’s Disease: A Clinical Framework and Deep Learning Approach Using NeuroParkNet