KV Cache Recycling to Expand Usable Context Capacity in Low Parameter LLMs

Prashant Pandey

doi:10.51244/IJRSI.2025.12120067

KV Cache Recycling to Expand Usable Context Capacity in Low Parameter LLMs

Authors

Prashant Pandey

Department of Computer Science & Engineering, Invertis University (India)

Article Information

DOI: 10.51244/IJRSI.2025.12120067

Subject Category: Computer Science

Volume/Issue: 12/12 | Page No: 809-818

Publication Timeline

Submitted: 2026-01-05

Accepted: 2026-01-05

Published: 2026-01-05

Abstract

Whether attention key value (KV) states computed for one prompt for a small LLM (not SLM as it is built on LLM architecture) can be reused to accelerate inference on a new similar prompt, giving an increase to the space to its context memory using an approach called token recycling. Using a standard Hugging Face setup with DialoGPT-medium (a 345M parameter GPT-2 style decoder trained on 147M Reddit exchanges, 2005-2017) as the testbed, we build a cache of past activations and get entries by sentence embeddings, then reuse cached past key values when the cached prompt is an exact prefix of the new input. We compare recycled vs. baseline runs on latency and output fidelity, and log reuse depth in tokens. Reproducibility requires no model modifications, cached KVs are serialized to the CPU, reloaded, and supplied to the generate function to continue decoding from the cached prefix. In tests, we observe consistent speedups when prefix overlap exists, with no material degradation in output semantics, and when overlap is absent, behavior matches baseline.

Keywords

Computer Science & Engineering

Downloads

PDF JATS XML

References

1. Zhang, Y., Sun, S., Galley, M., Chen, Y.-C., Brockett, C., Gao, X., Dolan, W.B.: DialoGPT: Large-scale Generative Pre-training for Conversational Response Generation. arXiv preprint arXiv:1911.00536 (2019). https://arxiv.org/abs/1911.00536 [Google Scholar] [Crossref]

2. Gurnee, W., Horsley, T., Guo, Z.C., Kheirkhah, T.R., Sun, Q., Hathaway, W., Nanda, N., Bertsimas, D.: Universal Neurons in GPT-2 Language Models. arXiv preprint arXiv:2401.12181 (2024). https://arxiv.org/abs/2401.12181 [Google Scholar] [Crossref]

3. Kim, Y., Kang, J., Park, S.: KVzip: Query-Agnostic KV Cache Compression with Context Reconstruction. arXiv preprint arXiv:2505.23416 (2025). https://arxiv.org/abs/2505.23416 [Google Scholar] [Crossref]

4. Kwon, W., Lee, S., Li, S., Luo, Z., Zheng, L., Liu, Z., Zhang, H., Stoica, I.: Efficient Memory Management for Large Language Model Serving with Paged Attention. arXiv preprint arXiv:2309.06180 (2023). https://arxiv.org/abs/2309.06180 [Google Scholar] [Crossref]

5. Not Lain: KV Caching Explained: Optimizing Transformer Inference Efficiency. Hugging Face Blog, Jan. 30, 2025. https://huggingface.co/blog/not-lain/kv-caching [Google Scholar] [Crossref]

6. Thomson, J., Shah, A., Tewari, L.: Introducing New KV Cache Reuse Optimizations in NVIDIA TensorRT-LLM. NVIDIA Developer Blog, Jan. 16, 2025. https://developer.nvidia.com/blog/introducing-new-kv-cache-reuse-optimizations-in-nvidia-tensorrt-llm/ [Google Scholar] [Crossref]

7. NVIDIA: KV Cache Reuse Large Language Models Documentation. NVIDIA Developer Docs (2025). https://docs.nvidia.com/nim/large-language-models/latest/kv-cache-reuse.html [Google Scholar] [Crossref]

8. CacheGen: KV Cache Compression and Streaming for Fast LLM Serving. Presented at SIGCOMM 2024. https://cs.stanford.edu/keithw/sigcomm2024/sigcomm24-final1571-acmpaginated.pdf [Google Scholar] [Crossref]

9. Xu, Z., Goyal, V.S., Rush, A.: Recycled Attention: Efficient Inference for Long-Context Language Models. arXiv preprint arXiv:2411.05787 (2024). https://arxiv.org/abs/2411.05787 [Google Scholar] [Crossref]

10. Wu, J., Zhang, O., Chen, Y.: Layer-Condensed KV Cache for Efficient Inference of Large Language Models. ACL 2024 Long Papers. https://aclanthology.org/2024.acl-long.602.pdf [Google Scholar] [Crossref]

11. Ge, Y., Zhang, Z., Chen, B.: Adaptive KV Cache Compression for Efficient LLM Inference. arXiv preprint arXiv:2310.01801 (2023). https://arxiv.org/pdf/2310.01801 [Google Scholar] [Crossref]

12. Li, Y., Jiang, H., Wu, Q., Luo, X., Ahn, S., Zhang, C., Abdi, A., Li, D., Gao, J., Yang, Y., Qiu, L.: SCBench: A KV Cache-Centric Analysis of Long-Context Methods. arXiv preprint arXiv:2412.10319v2 (2025). https://arxiv.org/html/2412.10319v2 [Google Scholar] [Crossref]

13. Hu, J., Li, X., Cheng, Y., Ray, S., Huang, Y., Zhang, Q., Du, K., Yao, J., Lu, S., Ananthanarayanan, G., Hoffmann, H., Jiang, J.: Chelsea: Efficient Long-Context LLM Inference via Online KV-Cache Clustering. arXiv preprint arXiv:2506.11418 (2025). https://arXiv.org/abs/2506.11418 [Google Scholar] [Crossref]

14. Behnam, P., Fu, Y., Zhao, R., Tsai, P.-A., Yu, Z., Tumanov, A.: RocketKV: Accelerating Long-Context LLM Inference via Two-Stage KV Cache Compression. arXiv preprint arXiv:2502.14051 (2025). https://arXiv.org/abs/2502.14051 [Google Scholar] [Crossref]

15. Chen, Y., Wang, G., Li, Z., Xu, H., Liu, W., He, X., Geng, Z.: NACL: A General and Effective KV-Cache Eviction Framework for Long-Context Inference. ACL 2024 Long Papers. https://aclanthology.org/2024.acl-long.428.pdf [Google Scholar] [Crossref]

16. Liu, Y., Li, H., Cheng, Y., Ray, S., Huang, Y., Zhang, Q., Du, K., Yao, J., Lu, S., Ananthanarayanan, G., Hoffmann, H., Maire, M.: CacheGen: KV Cache Compression and Streaming for Fast LLM Serving. SIGCOMM 2024. https://cs.stanford.edu/ keithw/sigcomm2024/sigcomm24-final1571-acmpaginated.pdf [Google Scholar] [Crossref]

17. Chen, Y., You, Z., Zhang, S., Li, H., Li, Y., Wang, Y., Tan, M.: Core Context Aware (CCA) Attention: Efficient Long-Range Context Modeling for Transformers. arXiv preprint arXiv:2412.12465 (2024). https://arXiv.org/abs/2412.12465 [Google Scholar] [Crossref]

18. Wang, G., Upasani, S., Wu, C., Gandhi, D., Li, J., Hu, C., Li, B., Thakker, U.: LLMs Know What to Drop: Self-Attention Guided KV Cache Eviction for Efficient Long-Context Inference. arXiv preprint arXiv:2503.08879 (2025). https://arXiv.org/abs/2503.08879 [Google Scholar] [Crossref]

19. Wu, W., Li, A., Zhang, Y., Chen, H.: TokenSelect: Dynamic Token-Level KV Cache Selection for Efficient Long-Context Inference. EMNLP 2025. https://aclanthology.org/2025.emnlp-main.1079.pdf [Google Scholar] [Crossref]

KV Cache Recycling to Expand Usable Context Capacity in Low Parameter LLMs

Authors

Article Information

Publication Timeline

Abstract

Keywords

Downloads

References

Metrics

Views & Downloads

Similar Articles