Interpretable, Expert-Aligned Composite Metric with Domain-Aware Calibration for Evaluating Natural Language Generation
Authors
College of Computing Studies, Information and Communication Technology, Isabela State University (Philippines)
College of Computing Studies, Information and Communication Technology, Isabela State University (Philippines)
Article Information
DOI: 10.51244/IJRSI.2026.13020053
Subject Category: Computer Science
Volume/Issue: 13/2 | Page No: 598-609
Publication Timeline
Submitted: 2026-02-11
Accepted: 2026-02-19
Published: 2026-02-27
Abstract
Automated metrics for natural language generation (NLG) often show weak or unstable alignment with expert judgment in domain-specific settings that require interpretability and tunability. Hence, this study designs and validates an interpretable composite metric that can be calibrated to expert consensus while being transparent. The researchers propose Comprehensive Quality Scoring (CQS), a hierarchical metric integrating contextual coherence and continuity (C3) with 5 interpretable linguistic factors, specifically relevance, readability, conciseness, structure, and information density, also introducing CLARION-G. This constrained calibrator learns a nonnegative simplex weight vector while preserving factor-level attribution. Evaluation uses 20 agriculture-oriented farmer FAQ items with responses generated by a local LLaMA 3.1 (8B) model and scored by expert panels across Agriculture, Linguistics, and Information Technology using a rubric based on MetricEval. Expert ratings are z-scored per rater and aggregated into a consensus target, with reliability assessed via ICC(2,1). To prevent leakage under ๐=20, calibration is performed strictly within leave-one-out cross-validation (LOOCV) (train on ๐-1, freeze weights, score the held-out item), with uncertainty quantified via Fisher-z confidence intervals and bootstrap resampling (B=1000). CLARION-G maximizes a penalized correlation objective with fixed coefficients ๐1=0.01, ๐๐๐๐=0.005, and ๐๐ฃ๐๐=0.003, optimized using Differential Evolution (population=15, maxiter=50, tol=10โ4, polish=True) with optional L-BFGS-B refinement (maxiter=300-500, ftol=10โ6-10โ8). In Agriculture, calibrated CQS achieves Pearsonโs ๐=0.688 with 95% CI [0.353, 0.867], surpassing baselines (e.g., BERTScore, Prometheus, METEOR) with statistically significant dependent-correlation gains. Learned top-level weights allocate 0.4 to C3 and 0.6 to linguistic quality, emphasizing relevance and information density. Bland-Altman analysis shows no fixed bias with limits of agreement ยฑ0.1134, and runtime remains practical (โ1.254 ms/item), supporting CQS/CLARION-G as an interpretable and operationally lightweight framework for expert-aligned NLG evaluation in specialized domains.
Keywords
Automated evaluation, Comprehensive Quality Scoring
Downloads
References
1. Aickin, M., & Gensler, H. (1996). Adjusting for multiple testing when reporting research results: The Bonferroni vs Holm methods. American Journal of Public Health, 86(5), 726โ728. https://doi.org/10.2105/AJPH.86.5.726 [Google Scholar] [Crossref]
2. Aynetdinov, A., & Akbik, A. (2024). SemScore: Automated Evaluation of Instruction-Tuned LLMs based on Semantic Textual Similarity. http://arxiv.org/abs/2401.17072 [Google Scholar] [Crossref]
3. Banerjee, S., & Lavie, A. (2005). METEORโฏ: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments. Proceedings Ofthe ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, June, 65โ72. [Google Scholar] [Crossref]
4. Bartko, J. J. (1966). The intraclass correlation coefficient as a measure of reliability. Psychological Reports, 19(1), 3โ11. https://doi.org/10.2466/pr0.1966.19.1.3 [Google Scholar] [Crossref]
5. Bavaresco, A., Bernardi, R., Bertolazzi, L., Elliott, D., Fernรกndez, R., Gatt, A., Ghaleb, E., Giulianelli, M., Hanna, M., Koller, A., Martins, A. F. T., Mondorf, P., Neplenbroek, V., Pezzelle, S., Plank, B., Schlangen, D., Suglia, A., Surikuchi, A. K., Takmaz, E., & Testoni, A. (2024). LLMs instead of Human Judges? A Large Scale Empirical Study across 20 NLP Evaluation Tasks. http://arxiv.org/abs/2406.18403 [Google Scholar] [Crossref]
6. Chung, G., & Baker, E. L. (2003). Issues in the reliability and validity of automated scoring of constructed responses. Automated Essay Grading: A Cross-Disciplinary Approach, 23โ40. [Google Scholar] [Crossref]
7. Colombo, P., Peyrard, M., Noiry, N., West, R., & Piantanida, P. (2023). The Glass Ceiling of Automatic Evaluation in Natural Language Generation. IJCNLP-AACL 2023 - 13th International Joint Conference on Natural Language Processing and the 3rd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics, Findings of the Association for Computational Linguistics: IJCNLP-AA, 178โ183. https://doi.org/10.18653/v1/2023.findings-ijcnlp.16 [Google Scholar] [Crossref]
8. Cox, N. J. (2008). Speaking Stata: Correlation with confidence, or Fisherโs z revisited. Stata Journal, 8(3), 413โ439. https://doi.org/10.1177/1536867x0800800307 [Google Scholar] [Crossref]
9. Davoodijam, E., & Alambardar Meybodi, M. (2024). Evaluation metrics on text summarization: comprehensive survey. Knowledge and Information Systems, 66(12), 7717โ7738. https://doi.org/10.1007/s10115-024-02217-0 [Google Scholar] [Crossref]
10. Gao, M., Hu, X., Ruan, J., Pu, X., & Wan, X. (2024). LLM-based NLG Evaluation: Current Status and Challenges. http://arxiv.org/abs/2402.01383 [Google Scholar] [Crossref]
11. Giavarina, D. (2015). Understanding Bland Altman analysis. Biochemia Medica, 25(2), 141โ151. https://doi.org/10.11613/BM.2015.015 [Google Scholar] [Crossref]
12. Graham, Y., & Baldwin, T. (2014). Testing for significance of increased correlation with human judgment. EMNLP 2014 - 2014 Conference on Empirical Methods in Natural Language Processing, Proceedings of the Conference, 172โ176. https://doi.org/10.3115/v1/d14-1020 [Google Scholar] [Crossref]
13. Gunawan, D., Sembiring, C. A., & Budiman, M. A. (2018). The Implementation of Cosine Similarity to Calculate Text Relevance between Two Documents. Journal of Physics: Conference Series, 978(1). https://doi.org/10.1088/1742-6596/978/1/012120 [Google Scholar] [Crossref]
14. Hu, X., Gao, M., Hu, S., Zhang, Y., Chen, Y., Xu, T., & Wan, X. (2024). Are LLM-based Evaluators Confusing NLG Quality Criteria? Proceedings of the Annual Meeting of the Association for Computational Linguistics, 1, 9530โ9570. https://doi.org/10.18653/v1/2024.acl-long.516 [Google Scholar] [Crossref]
15. Kalinauskaitฤ, D. (2018). Detecting information-dense texts: Towards an automated analysis. CEUR Workshop Proceedings, 2145, 95โ98. [Google Scholar] [Crossref]
16. Karch, J. (2020). Improving on adjusted R-squared. Collabra: Psychology, 6(1), 1โ11. https://doi.org/10.1525/collabra.343 [Google Scholar] [Crossref]
17. Kim, S., Shin, J., Cho, Y., Jang, J., Longpre, S., Lee, H., Yun, S., Shin, S., Kim, S., Thorne, J., & Seo, M. (2024). Prometheus: Inducing Fine-Grained Evaluation Capability in Language Models. 12th International Conference on Learning Representations, ICLR 2024, 1โ37. [Google Scholar] [Crossref]
18. Kim, S., Suk, J., Welleck, S., Neubig, G., Longpre, S., Yuchen, B., Jamin, L., Lee, M., Lee, K., Seo, M., & Ai, K. (2024). PROMETHEUS 2: An Open Source Language Model Specialized in Evaluating Other Language Models. [Google Scholar] [Crossref]
19. Kirschen, R. H., OโHiggins, E. A., & Lee, R. T. (2000). A Study of CrossValidation and Bootstrap for Accuracy Estimation and Model Selection. American Journal of Orthodontics and Dentofacial Orthopedics, 118(4), 456โ461. https://doi.org/10.1067/mod.2000.109032 [Google Scholar] [Crossref]
20. Lin, C.-Y. (2004). Looking for a Few Good Metrics: ROUGE and its Evaluation. NTCIR Workshop, June, 1โ8. [Google Scholar] [Crossref]
21. Meijer, R. J., & Goeman, J. J. (2013). Efficient approximate k-fold and leave-one-out cross-validation for ridge regression. Biometrical Journal, 55(2), 141โ155. https://doi.org/10.1002/bimj.201200088 [Google Scholar] [Crossref]
22. OโNeill, J., & Bollegala, D. (2020). Learning to Evaluate Neural Language Models BT - Computational Linguistics (L.-M. Nguyen, X.-H. Phan, K. Hasida, & S. Tojo (eds.); pp. 123โ133). Springer Singapore. [Google Scholar] [Crossref]
23. Oyama, M., & Shimodaira, H. (2023). Revisiting Cosine Similarity via Normalized ICA-transformed Embeddings. 2023. https://doi.org/https://doi.org/10.48550/arXiv.2406.10984 Focus to learn more [Google Scholar] [Crossref]
24. Papineni, K., Roukos, S., Ward, T., & Zhu, W.-J. (2002). BLEU: a Method for Automatic Evaluation of Machine Translation. Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics (ACL), 311โ318. https://doi.org/10.1002/andp.19223712302 [Google Scholar] [Crossref]
25. Schmidtova, P., Mahamood, S., Balloccu, S., Dusek, O., Gatt, A., Gkatzia, D., Howcroft, D. M., Platek, O., & Sivaprasad, A. (2025). Automatic Metrics in Natural Language Generation: A survey of Current Evaluation Practices. 557โ583. https://doi.org/10.18653/v1/2024.inlg-main.44 [Google Scholar] [Crossref]
26. Steiger, J. H. (1980). Tests for comparing elements of a correlation matrix. Psychological Bulletin, 87(2), 245โ251. https://doi.org/10.1037//0033-2909.87.2.245 [Google Scholar] [Crossref]
27. Susoy, Z. (2023). Lexical Density, Lexical Diversity and Academic Vocabulary Use: Differences in Dissertation Abstracts. Acuity: Journal of English Language Pedagogy, Literature, and Culture, 8(2), 198โ210. https://doi.org/10.35974/acuity.v8i2.3079 [Google Scholar] [Crossref]
28. Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Roziรจre, B., Goyal, N., Hambro, E., Azhar, F., Rodriguez, A., Joulin, A., Grave, E., & Lample, G. (2023). LLaMA: Open and Efficient Foundation Language Models. http://arxiv.org/abs/2302.13971 [Google Scholar] [Crossref]
29. Vajjala, S., & Meurers, D. (2016). Readability-based Sentence Ranking for Evaluating Text Simplification. http://arxiv.org/abs/1603.06009 [Google Scholar] [Crossref]
30. Verduijn, M., Peek, N., de Keizer, N. F., van Lieshout, E. J., de Pont, A. C. J. M., Schultz, M. J., de Jonge, E., & de Mol, B. A. J. M. (2008). Individual and Joint Expert Judgments as Reference Standards in Artifact Detection. Journal of the American Medical Informatics Association, 15(2), 227โ234. https://doi.org/10.1197/jamia.M2493 [Google Scholar] [Crossref]
31. Xiao, Z., Zhang, S., Lai, V., & Liao, Q. V. (2023). Evaluating Evaluation Metrics: A Framework for Analyzing NLG Evaluation Metrics using Measurement Theory. EMNLP 2023 - 2023 Conference on Empirical Methods in Natural Language Processing, Proceedings, 10967โ10982. https://doi.org/10.18653/v1/2023.emnlp-main.676 [Google Scholar] [Crossref]
32. Yang, Y., Zhong, J., Wang, C., & Li, Q. (2022). Exploring Relevance and Coherence for Automated Text Scoring using Multi-task Learning. Proceedings of the International Conference on Software Engineering and Knowledge Engineering, SEKE, 323โ328. https://doi.org/10.18293/SEKE2022-024 [Google Scholar] [Crossref]
33. Zhang, T., Kishore, V., Wu, F., Weinberger, K. Q., & Artzi, Y. (2020). Bertscore: Evaluating Text Generation With Bert. 8th International Conference on Learning Representations, ICLR 2020, 1โ43. [Google Scholar] [Crossref]
Metrics
Views & Downloads
Similar Articles
- What the Desert Fathers Teach Data Scientists: Ancient Ascetic Principles for Ethical Machine-Learning Practice
- Comparative Analysis of Some Machine Learning Algorithms for the Classification of Ransomware
- Comparative Performance Analysis of Some Priority Queue Variants in Dijkstraโs Algorithm
- Transfer Learning in Detecting E-Assessment Malpractice from a Proctored Video Recordings.
- Dual-Modal Detection of Parkinsonโs Disease: A Clinical Framework and Deep Learning Approach Using NeuroParkNet