KV Cache Recycling to Expand Usable Context Capacity in Low Parameter LLMs
Authors
Department of Computer Science & Engineering, Invertis University (India)
Article Information
DOI: 10.51244/IJRSI.2025.12120067
Subject Category: Computer Science
Volume/Issue: 12/12 | Page No: 809-818
Publication Timeline
Submitted: 2026-01-05
Accepted: 2026-01-05
Published: 2026-01-05
Abstract
Whether attention key value (KV) states computed for one prompt for a small LLM (not SLM as it is built on LLM architecture) can be reused to accelerate inference on a new similar prompt, giving an increase to the space to its context memory using an approach called token recycling. Using a standard Hugging Face setup with DialoGPT-medium (a 345M parameter GPT-2 style decoder trained on 147M Reddit exchanges, 2005-2017) as the testbed, we build a cache of past activations and get entries by sentence embeddings, then reuse cached past key values when the cached prompt is an exact prefix of the new input. We compare recycled vs. baseline runs on latency and output fidelity, and log reuse depth in tokens. Reproducibility requires no model modifications, cached KVs are serialized to the CPU, reloaded, and supplied to the generate function to continue decoding from the cached prefix. In tests, we observe consistent speedups when prefix overlap exists, with no material degradation in output semantics, and when overlap is absent, behavior matches baseline.
Keywords
Computer Science & Engineering
Downloads
References
1. Zhang, Y., Sun, S., Galley, M., Chen, Y.-C., Brockett, C., Gao, X., Dolan, W.B.: DialoGPT: Large-scale Generative Pre-training for Conversational Response Generation. arXiv preprint arXiv:1911.00536 (2019). https://arxiv.org/abs/1911.00536 [Google Scholar] [Crossref]
2. Gurnee, W., Horsley, T., Guo, Z.C., Kheirkhah, T.R., Sun, Q., Hathaway, W., Nanda, N., Bertsimas, D.: Universal Neurons in GPT-2 Language Models. arXiv preprint arXiv:2401.12181 (2024). https://arxiv.org/abs/2401.12181 [Google Scholar] [Crossref]
3. Kim, Y., Kang, J., Park, S.: KVzip: Query-Agnostic KV Cache Compression with Context Reconstruction. arXiv preprint arXiv:2505.23416 (2025). https://arxiv.org/abs/2505.23416 [Google Scholar] [Crossref]
4. Kwon, W., Lee, S., Li, S., Luo, Z., Zheng, L., Liu, Z., Zhang, H., Stoica, I.: Efficient Memory Management for Large Language Model Serving with Paged Attention. arXiv preprint arXiv:2309.06180 (2023). https://arxiv.org/abs/2309.06180 [Google Scholar] [Crossref]
5. Not Lain: KV Caching Explained: Optimizing Transformer Inference Efficiency. Hugging Face Blog, Jan. 30, 2025. https://huggingface.co/blog/not-lain/kv-caching [Google Scholar] [Crossref]
6. Thomson, J., Shah, A., Tewari, L.: Introducing New KV Cache Reuse Optimizations in NVIDIA TensorRT-LLM. NVIDIA Developer Blog, Jan. 16, 2025. https://developer.nvidia.com/blog/introducing-new-kv-cache-reuse-optimizations-in-nvidia-tensorrt-llm/ [Google Scholar] [Crossref]
7. NVIDIA: KV Cache Reuse Large Language Models Documentation. NVIDIA Developer Docs (2025). https://docs.nvidia.com/nim/large-language-models/latest/kv-cache-reuse.html [Google Scholar] [Crossref]
8. CacheGen: KV Cache Compression and Streaming for Fast LLM Serving. Presented at SIGCOMM 2024. https://cs.stanford.edu/keithw/sigcomm2024/sigcomm24-final1571-acmpaginated.pdf [Google Scholar] [Crossref]
9. Xu, Z., Goyal, V.S., Rush, A.: Recycled Attention: Efficient Inference for Long-Context Language Models. arXiv preprint arXiv:2411.05787 (2024). https://arxiv.org/abs/2411.05787 [Google Scholar] [Crossref]
10. Wu, J., Zhang, O., Chen, Y.: Layer-Condensed KV Cache for Efficient Inference of Large Language Models. ACL 2024 Long Papers. https://aclanthology.org/2024.acl-long.602.pdf [Google Scholar] [Crossref]
11. Ge, Y., Zhang, Z., Chen, B.: Adaptive KV Cache Compression for Efficient LLM Inference. arXiv preprint arXiv:2310.01801 (2023). https://arxiv.org/pdf/2310.01801 [Google Scholar] [Crossref]
12. Li, Y., Jiang, H., Wu, Q., Luo, X., Ahn, S., Zhang, C., Abdi, A., Li, D., Gao, J., Yang, Y., Qiu, L.: SCBench: A KV Cache-Centric Analysis of Long-Context Methods. arXiv preprint arXiv:2412.10319v2 (2025). https://arxiv.org/html/2412.10319v2 [Google Scholar] [Crossref]
13. Hu, J., Li, X., Cheng, Y., Ray, S., Huang, Y., Zhang, Q., Du, K., Yao, J., Lu, S., Ananthanarayanan, G., Hoffmann, H., Jiang, J.: Chelsea: Efficient Long-Context LLM Inference via Online KV-Cache Clustering. arXiv preprint arXiv:2506.11418 (2025). https://arXiv.org/abs/2506.11418 [Google Scholar] [Crossref]
14. Behnam, P., Fu, Y., Zhao, R., Tsai, P.-A., Yu, Z., Tumanov, A.: RocketKV: Accelerating Long-Context LLM Inference via Two-Stage KV Cache Compression. arXiv preprint arXiv:2502.14051 (2025). https://arXiv.org/abs/2502.14051 [Google Scholar] [Crossref]
15. Chen, Y., Wang, G., Li, Z., Xu, H., Liu, W., He, X., Geng, Z.: NACL: A General and Effective KV-Cache Eviction Framework for Long-Context Inference. ACL 2024 Long Papers. https://aclanthology.org/2024.acl-long.428.pdf [Google Scholar] [Crossref]
16. Liu, Y., Li, H., Cheng, Y., Ray, S., Huang, Y., Zhang, Q., Du, K., Yao, J., Lu, S., Ananthanarayanan, G., Hoffmann, H., Maire, M.: CacheGen: KV Cache Compression and Streaming for Fast LLM Serving. SIGCOMM 2024. https://cs.stanford.edu/ keithw/sigcomm2024/sigcomm24-final1571-acmpaginated.pdf [Google Scholar] [Crossref]
17. Chen, Y., You, Z., Zhang, S., Li, H., Li, Y., Wang, Y., Tan, M.: Core Context Aware (CCA) Attention: Efficient Long-Range Context Modeling for Transformers. arXiv preprint arXiv:2412.12465 (2024). https://arXiv.org/abs/2412.12465 [Google Scholar] [Crossref]
18. Wang, G., Upasani, S., Wu, C., Gandhi, D., Li, J., Hu, C., Li, B., Thakker, U.: LLMs Know What to Drop: Self-Attention Guided KV Cache Eviction for Efficient Long-Context Inference. arXiv preprint arXiv:2503.08879 (2025). https://arXiv.org/abs/2503.08879 [Google Scholar] [Crossref]
19. Wu, W., Li, A., Zhang, Y., Chen, H.: TokenSelect: Dynamic Token-Level KV Cache Selection for Efficient Long-Context Inference. EMNLP 2025. https://aclanthology.org/2025.emnlp-main.1079.pdf [Google Scholar] [Crossref]
Metrics
Views & Downloads
Similar Articles
- What the Desert Fathers Teach Data Scientists: Ancient Ascetic Principles for Ethical Machine-Learning Practice
- Comparative Analysis of Some Machine Learning Algorithms for the Classification of Ransomware
- Comparative Performance Analysis of Some Priority Queue Variants in Dijkstra’s Algorithm
- Transfer Learning in Detecting E-Assessment Malpractice from a Proctored Video Recordings.
- Dual-Modal Detection of Parkinson’s Disease: A Clinical Framework and Deep Learning Approach Using NeuroParkNet