Five Lines, One Question, A Micro-Benchmark for Evaluating Structural Code Understanding
Authors
Faculty of Science and Computing, Department of Computer Science, North Eastern University Gombe (Nigeria)
Department of Software Engineering, Nile University of Nigeria, Abuja (Nigeria)
Article Information
DOI: 10.51244/IJRSI.2025.12120061
Subject Category: Computer Science
Volume/Issue: 12/12 | Page No: 717-726
Publication Timeline
Submitted: 2025-12-22
Accepted: 2025-12-29
Published: 2026-01-05
Abstract
While large language models demonstrate impressive fluency in code generation, their ability to perform precise, structural reasoning about code remains poorly understood. This paper introduces Five Lines, One Question (5L1Q), a minimalist benchmark that isolates fundamental reasoning capabilities by presenting models with trivial five-line code snippets and atomic questions about their structure. Our evaluation of state-of-the-art models reveals a striking reasoning-comprehension gap: although models excel at pattern-based tasks like token localization (up to 92% accuracy), their performance dramatically degrades on tasks requiring structural reasoning such as change detection (as low as 22% accuracy) and symbolic substitution. This consistent failure pattern, where models struggle with operations that are trivial for traditional program analysis tools, suggests current architectures lack robust internal mechanisms for representing and manipulating code structure. Our benchmark provides a lightweight, interpretable diagnostic tool for evaluating this critical dimension of code understanding, offering a pathway toward models that combine generative fluency with genuine analytical capability.
Keywords
code understanding, large language models, benchmark, structural reasoning, programming languages.
Downloads
References
1. Chen, M., Tworek, J., Jun, H., Yuan, Q., Pinto, H. P. D. O., Kaplan, J., Edwards, H., Burda, Y., Joseph, N., Brockman, G., Ray, A., Puri, R., Krueger, G., Petrov, M., Khlaaf, H., Sastry, G., Mishkin, P., Chan, B., Gray, S., ... Zaremba, W. (2021). Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374. https://doi.org/10.48550/arXiv.2107.03374 [Google Scholar] [Crossref]
2. Chen, X., Lin, M., Schärli, N., and Zhou, D. (2024). CodeAgent: Enhancing code generation with tool-integrated agent systems. Proceedings of the ACM SIGPLAN Symposium on Programming Languages, 88–102. https://doi.org/10.1145/3671236.3674567 [Google Scholar] [Crossref]
3. Chowdhery, A., Narang, S., Devlin, J., Bosma, M., Mishra, G., Roberts, A., Barham, P., Chung, H. W., Sutton, C., Gehrmann, S., Schuh, P., Shi, K., Tsvyashchenko, S., Maynez, J., Rao, A., Barnes, P., Tay, Y., Shazeer, N., Prabhakaran, V., ... Fiedel, N. (2022). PaLM: Scaling language modeling with pathways. Journal of Machine Learning Research, 24(240), 1–113. https://jmlr.org/papers/v24/22-1144.html [Google Scholar] [Crossref]
4. Finnie-Ansley, J., Denny, P., Becker, B. A., Luxton-Reilly, A., and Prather, J. (2022). The robots are coming: Exploring the implications of OpenAI Codex on introductory programming. Proceedings of the 24th Australasian Computing Education Conference, 10–19. [Google Scholar] [Crossref]
5. https://doi.org/10.1145/3511861.3511863 [Google Scholar] [Crossref]
6. Haluptzok, P., Bowers, M., and Kalai, A. T. (2022). Language models can teach themselves to program better. arXiv preprint arXiv:2207.14502. https://doi.org/10.48550/arXiv.2207.14502 [Google Scholar] [Crossref]
7. Jain, N., Vaidyanath, S., Iyer, A., Natarajan, N., Parthasarathy, S., Rajamani, S., and Sharma, R. (2022). Jigsaw: Large language models meet program synthesis. Proceedings of the 44th International Conference on Software Engineering (ICSE 2022), 1219–1231. [Google Scholar] [Crossref]
8. https://doi.org/10.1145/3510003.3510203 [Google Scholar] [Crossref]
9. Jimenez, C. E., Yang, J., Wettig, A., Yao, S., Pei, K., Press, O., and Narasimhan, K. (2024). SWE-bench: Can language models resolve real-world GitHub issues? arXiv preprint arXiv:2310.06770. https://doi.org/10.48550/arXiv.2310.06770 [Google Scholar] [Crossref]
10. Lai, Y., Li, C., Wang, Y., Zhang, T., Zhong, R., Zettlemoyer, L., Yih, W., Fried, D., Wang, S., and Yu, T. (2023). *DS-1000: A natural and reliable benchmark for data science code generation*. Proceedings of the 40th International Conference on Machine Learning, 18319–18345. [Google Scholar] [Crossref]
11. https://proceedings.mlr.press/v202/lai23a.html [Google Scholar] [Crossref]
12. Li, Y., Choi, D., Chung, J., Kushman, N., Schrittwieser, J., Leblond, R., Eccles, T., Keeling, J., Gimeno, F., Dal Lago, A., Hubert, T., Choy, P., de Masson d'Autume, C., Babuschkin, I., Chen, X., Huang, P., Welbl, J., Gowal, S., Cherepanov, A., ... Vinyals, O. (2022). Competition-level code generation with AlphaCode. Science, 378(6624), 1092–1097. https://doi.org/10.1126/science.abq1158 [Google Scholar] [Crossref]
13. Liu, J., Xia, C. S., Wang, Y., and Zhang, L. (2023). Is your code generated by ChatGPT really correct? Rigorous evaluation of large language models for code generation. Proceedings of the 37th International Conference on Neural Information Processing Systems. [Google Scholar] [Crossref]
14. https://openreview.net/forum?id=1qvx610Cu7 [Google Scholar] [Crossref]
15. Lu, S., Guo, D., Ren, S., Huang, J., Svyatkovskiy, A., Blanco, A., Clement, C., Drain, D., Jiang, D., Tang, D., Li, G., Zhou, L., Shou, L., Zhou, L., Tufano, M., Gong, M., Zhou, M., Duan, N., Sundaresan, N., ... Tufano, M. (2021). CodeXGLUE: A benchmark dataset and open challenge for code intelligence. arXiv preprint arXiv:2102.04664. https://doi.org/10.48550/arXiv.2102.04664 [Google Scholar] [Crossref]
16. Ni, A., Iyer, S., Radev, D., Stoyanov, V., Yih, W.-t., Wang, S., and Lin, X. V. (2023). LEVER: Learning to verify language-to-code generation with execution. Proceedings of the 40th International Conference on Machine Learning (ICML 2023), 26106–26128. [Google Scholar] [Crossref]
17. https://proceedings.mlr.press/v202/ni23a.html [Google Scholar] [Crossref]
18. Nye, M., Andreassen, A. J., Gur-Ari, G., Michalewski, H., Austin, J., Bieber, D., Dohan, D., Lewkowycz, A., Bosma, M., Luan, D., Sutton, C., and Odena, A. (2021). Show your work: Scratchpads for intermediate computation with language models. arXiv preprint arXiv:2112.00114. https://doi.org/10.48550/arXiv.2112.00114 [Google Scholar] [Crossref]
19. OpenAI. (2023). *GPT-4 technical report*. arXiv preprint arXiv:2303.08774. https://doi.org/10.48550/arXiv.2303.08774 [Google Scholar] [Crossref]
20. Rozière, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X. E., Adi, Y., Liu, J., Remez, T., Rapin, J., Kozhevnikov, A., Evtimov, I., Bitton, J., Bhatt, M., Ferrer, C. C., Grattafiori, A., Xiong, W., Défossez, A., Copet, J., ... Synnaeve, G. (2023). Code Llama: Open foundation models for code. arXiv preprint arXiv:2308.12950. https://doi.org/10.48550/arXiv.2308.12950 [Google Scholar] [Crossref]
21. Shrivastava, D., Larionov, A., Shinde, P., and Allamanis, M. (2023). Retrieval-based localization for code generation with transformers. Proceedings of the Conference on Empirical Methods in Natural Language Processing, 345–357. https://doi.org/10.18653/v1/2023.emnlp-main.23 [Google Scholar] [Crossref]
22. Siddiq, M. L., Hassan, S., Latif, M. A., and Shahriyar, R. (2024). An empirical study of code smell detection by LLMs: Challenges and opportunities. Proceedings of the 21st International Conference on Mining Software Repositories (MSR 2024), 234–245. https://doi.org/10.1145/3643991.3644892 [Google Scholar] [Crossref]
23. Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y., Bashlykov, N., Batra, S., Bhargava, P., Bhosale, S., Bikel, D., Blecher, L., Ferrer, C. C., Chen, M., Cucurull, G., Esiobu, D., Fernandes, J., Fu, J., Fu, W., ... Scialom, T. (2023). *Llama 2: Open foundation and fine-tuned chat models*. arXiv preprint arXiv:2307.09288. https://doi.org/10.48550/arXiv.2307.09288 [Google Scholar] [Crossref]
24. Valmeekam, K., Olmo, A., Sreedharan, S., and Kambhampati, S. (2023). On the planning abilities of large language models A critical investigation. Advances in Neural Information Processing Systems, 36,75993–76005. https://proceedings.neurips.cc/paper_files/paper/2023/hash/0c22d45b5cabc6b8c7d3c6ac9a37f7b7-Abstract-Conference.html [Google Scholar] [Crossref]
25. Wang, A., Singh, A., Michael, J., Hill, F., Levy, O., and Bowman, S. R. (2019). GLUE: A multi-task benchmark and analysis platform for natural language understanding. Proceedings of the 7th International Conference on Learning Representations. https://openreview.net/forum?id=rJ4km2R5t7 [Google Scholar] [Crossref]
26. Wei, J., Wang, X., Schuurmans, D., Bosma, M., Xia, F., Chi, E., Le, Q. V., and Zhou, D. (2022). Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35, 24824–24837. [Google Scholar] [Crossref]
27. https://proceedings.neurips.cc/paper_files/paper/2022/hash/9d5609613524ecf4f15af0f7b31abca4-Abstract-Conference.html [Google Scholar] [Crossref]
28. Zan, D., Chen, B., Lin, Z., Guan, B., Yong, J., and Lou, J.-G. (2023). Large language models for code repair: A benchmark and some initial observations. Proceedings of the 1st Workshop on Natural Language Processing for Software Engineering (NLP4SE 2023), 24–33. [Google Scholar] [Crossref]
29. https://aclanthology.org/2023.nlp4se-1.3/ [Google Scholar] [Crossref]
Metrics
Views & Downloads
Similar Articles
- What the Desert Fathers Teach Data Scientists: Ancient Ascetic Principles for Ethical Machine-Learning Practice
- Comparative Analysis of Some Machine Learning Algorithms for the Classification of Ransomware
- Comparative Performance Analysis of Some Priority Queue Variants in Dijkstra’s Algorithm
- Transfer Learning in Detecting E-Assessment Malpractice from a Proctored Video Recordings.
- Dual-Modal Detection of Parkinson’s Disease: A Clinical Framework and Deep Learning Approach Using NeuroParkNet