A Quantization-Aware Optimization Framework for Efficient Deep Neural Network Inference

Mohamed Almoudane

doi:10.47772/IJRISS.2026.10100169

A Quantization-Aware Optimization Framework for Efficient Deep Neural Network Inference

Authors

Mohamed Almoudane

School of Computer Science, Nanjing University of Information Science and Technology, Nanjing (China)

Article Information

DOI: 10.47772/IJRISS.2026.10100169

Subject Category: Computer Science

Volume/Issue: 10/1 | Page No: 2141-2156

Publication Timeline

Submitted: 2026-01-15

Accepted: 2026-01-20

Published: 2026-01-29

Abstract

The growing demand for deploying deep neural network (DNN) inference on resource-constrained platforms has intensified challenges related to computational cost, memory footprint, and energy efficiency [1], [2]. Quantization is widely adopted to address these constraints; however, conventional low- bit quantization methods often suffer from severe accuracy degradation, commonly referred to as the performance cliff phe-nomenon [3], [4].
In this work, we propose a unified Quantization-Aware Optimization Framework (QAOF) that bridges high-precision floating-point training and efficient integer-only inference. The framework incorporates a multi-level, layer-wise sensitivity analysis based on the average Hessian trace to characterize loss curvature and guide precision allocation across the network [5]. To mitigate accuracy loss caused by inter-channel and inter-layer distribution mismatch in hybrid architectures, we further introduce Quantization-Aware Distribution Scaling (QADS), which adaptively aligns weight and activation distributions prior to quantization. In addition, computationally expensive operations are replaced with piecewise linear, integer-friendly formulations to ena-ble efficient execution on low-power hardware [6].
Extensive evaluations on representative architectures, including ResNet, MobileNet, and Vision Transformers (ViT), demonstrate that QAOF achieves substantial efficiency gains with minimal accuracy impact. Across standard benchmarks, the proposed method delivers up to 4.2× inference speedup and up to 75% memory reduction, while maintaining accuracy loss below 0.4%. Finally, we provide practical guidelines for selecting between post-training quantization and quantization aware training under diverse hardware deployment scenar-ios [7], [8].

Keywords

Quantization, Aware, Optimization , Framework

Downloads

PDF JATS XML

References

1. A. Gholami et al., “A survey of quantization methods for efficient neural network inference,” arXiv:2103.13630, 2021. [Google Scholar] [Crossref]

2. M. Nagel et al., “Mixed precision quantization: A survey,” IEEE Access, 2021. [Google Scholar] [Crossref]

3. Z. Yao et al., “HAWQ-V2: Hessian-aware trace-weighted quantization of neural networks,” in Proc. NeurIPS, 2020. [Google Scholar] [Crossref]

4. S. Uhlich et al., “Bit-width search for mixed-precision neural networks,” in Proc. ICLR, 2020. [Google Scholar] [Crossref]

5. S. Esser et al., “LSQ: Learned step size quantization,” in Proc. ICLR, 2020. [Google Scholar] [Crossref]

6. H. Cai, L. Zhu, and S. Han, “Once-for-all: Train one network and specialize it for efficient deploy-ment,” in Proc. ICLR, 2020. [Google Scholar] [Crossref]

7. T.-J. Yang, Y.-H. Chen, and V. Sze, “Hardware-aware neural architecture search: A survey,” ACM Computing Surveys, 2020. [Google Scholar] [Crossref]

8. K. Wang, Z. Liu, and J. Lin, “Joint neural architecture and quantization search,” in Proc. CVPR, 2020. [Google Scholar] [Crossref]

9. M. Rusci et al., “Post-training quantization for deep neural networks on microcontrollers,” in Proc. DATE, 2020. [Google Scholar] [Crossref]

10. J. Park and W. Sung, “Efficient low-bit neural network inference with INT4 precision,” in NeurIPS Workshops, 2020. [Google Scholar] [Crossref]

11. A. Dosovitskiy et al., “An image is worth 16×16 words: Transformers for image recognition at scale,” in Proc. ICLR, 2021. [Google Scholar] [Crossref]

12. Y. Li and T. Chen, “On the pitfalls of differentiable mixed-precision quantization,” arXiv:2203.01245, 2022. [Google Scholar] [Crossref]

13. NVIDIA, “Tensor RT: High-performance deep learning inference platform,” NVIDIA Developer Doc-umentation, 2022. [Google Scholar] [Crossref]

14. Intel, “Open VINO toolkit: Optimizing deep learning inference on CPUs,” Intel White Paper, 2021. [Google Scholar] [Crossref]

15. T. Dettmers and L. Zettlemoyer, “Outlier-aware quantization for transformer models,” in Proc. ICLR, 2023. [Google Scholar] [Crossref]

16. G. Xiao et al., “SmoothQuant: Accurate and efficient post-training quantization for large language models,” in Proc. ICML, 2023. [Google Scholar] [Crossref]

17. Y.-H. Chen, J. Emer, and V. Sze, “Efficient deployment of deep neural networks on heterogeneous hardware,” IEEE Micro, 2021. [Google Scholar] [Crossref]

18. J. Frantar et al., “GPTQ: Accurate post-training quantization for generative pre-trained transformers,” in Proc. NeurIPS, 2022. [Google Scholar] [Crossref]

19. H. Liu et al., “Mixed-precision post-training quantization for neural networks,” IEEE Transactions on Neural Networks and Learning Systems, 2021. [Google Scholar] [Crossref]

20. S. Lin et al., “MCUNet: Tiny deep learning on IoT devices,” in Proc. NeurIPS, 2020. [Google Scholar] [Crossref]

21. Q. Chen et al., “Fully integer quantization for deep neural networks,” IEEE Transactions on Circuits and Systems for Video Technology, 2021. [Google Scholar] [Crossref]

22. R. Jain et al., “Accurate and efficient post-training quantization for vision transformers,” in Proc. CVPR, 2022. [Google Scholar] [Crossref]

23. S. Migacz, “8-bit inference with TensorRT,” NVIDIA GTC Technical Report, 2020. [Google Scholar] [Crossref]

24. J. Lin et al., “Q-ViT: Accurate and fully quantized low-bit vision transformer,” in Proc. NeurIPS, 2022. [Google Scholar] [Crossref]

25. H. Wei et al., “ActQ: Activation-aware weight quantization for transformers,” in Proc. ICLR, 2023. [Google Scholar] [Crossref]

26. Y. Sheng et al., “AWQ: Activation-aware weight quantization for LLM compression and acceleration,” in Proc. MLSys, 2024. [Google Scholar] [Crossref]

27. Z. Liu et al., “Rethinking outlier suppression noticing activation sparsity in transformer quantization,” in Proc. ICML, 2024. [Google Scholar] [Crossref]

A Quantization-Aware Optimization Framework for Efficient Deep Neural Network Inference

Authors

Article Information

Publication Timeline

Abstract

Keywords

Downloads

References

Metrics

Views & Downloads

Similar Articles